F# task parallelism under Mono doesn't "appear" to execute in parallel

F# task parallelism under Mono doesn't "appear" to execute in parallel - asynchronous

I have the following dummy code to test out TPL in F#. (Mono 4.5, Xamarin studio, quad core MacBook Pro)
To my surprise, all the processes are done on the same thread. There is no parallelism at all.
open System
open System.Threading
open System.Threading.Tasks
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
()
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
[<EntryPoint>]
let main argv =
let t2 = Task.Factory.StartNew(fun() -> doWork 10 2)
//printfn "launched t2"
Console.WriteLine("launched t2")
let t1 = Task.Factory.StartNew(fun() -> doWork 8 1)
Console.WriteLine("launched t1")
let t3 = Task.Factory.StartNew(fun() -> doWork 10 3)
Console.WriteLine("launched t3")
let t4 = Task.Factory.StartNew(fun() -> doWork 5 4)
Console.WriteLine("launched t4")
Task.WaitAll(t1,t2,t3,t4)
0 // return an integer exit code
However, if I increase the thread sleep time from 10 to 100ms, I can see a little parallelism.
What have I done wrong? What does this mean? I did consider the possibility of the CPU finished the work before TPL can start the task on a new thread. But this doesn't make sense to me. I can increase the inner dummy loop for j in 1 .. 1000 do () to loop 1000 more times. The result is the same: no parallelism (thread.sleep is set 10 ms).
The same code in C# on the other hand, produces the desired results: all tasks print the message to the window in a mixed order (rather than sequential order)
Update:
As suggested I changed the inner loop to do some 'actual' thing but the result is still execution on the single thread
Update 2:
I don't quite understand Luaan's comments but I just did a test on a friend's PC. And with the same code, parallelism is working (without thread sleep). It looks like something to do with Mono. But can Luaan explain what I should expect from TPL again? If I have tasks that I want to perform in parallel and taking advantage of the multicore CPU, isn't TPL the way to go?
Update 3:
I have tried out #FyodorSoikin's suggestion again with dummy code that won't be optimized away. Unfortunately, the workload still is not able to make Mono TPL to use multiple threads. Currently the only way I can get Mono TPL to allocate multiple threads is to force a sleep on the existing thread for more than 20ms. I am not qualified enough to asset that Mono is wrong, but I can confirm the same code (same benchmark workload) have the different behaviors under Mono and Windows.

It looks like the Sleeps are ignored completely - see how the Task 2 loop is printed even before launching the next task, that's just silly - if the thread waited for 10ms, there's no way for that to happen.
I'd assume that the cause might be the timer resolution in the OS. The Sleep is far from accurate - it might very well be that Mono (or Mac OS) decides that since they can't reliably make you run again in 10ms, the best choice is to simply let you run right now. This is not how it works on Windows - there you're guaranteed to lose control as long as you don't Sleep(0); you'll always sleep at least as long as you wanted. It seems that on Mono / Mac OS, the idea is the reverse - the OS tries to let you sleep at most the amount of time you specified. If you want to sleep for less time than is the timer precision, too bad - no sleep.
But even if they are not ignored, there's still not a lot of pressure on the thread pool to give you more threads. You're only blocking for less than 100ms, for four tasks in a line - that's not nearly enough for the pool to start creating new threads to handle the requests (on MS.NET, new threads are only spooled after not having any free threads for 200ms, IIRC). You're simply not doing enough work for it to be worth it to spool up new threads!
The point you might be missing is that Task.Factory.StartNew is not actually starting any new threads, ever. Instead, it's scheduling the associated task on the default task scheduler - which just puts it in the thread pool queue, as tasks to execute "at earliest convenience", basically. If there's one free thread in the pool, the first tasks starts running there almost immediately. The second will run when there's another thread free etc. Only if the thread usage is "bad" (i.e. the threads are "blocked" - they're not doing any CPU work, but they're not free either) is the threadpool going to spawn new threads.

If you look at the IL output from this program, you'll see that the inner loop is optimized away, because it doesn't have any side effects, and its return value is completely ignored.
To make it count, put something non-optimizable there, and also make it heavier: 1000 empty cycles is hardly noticeable compared to the cost of spinning up a new task.
For example:
let doWork (num:int) (taskId:int) : unit =
for i in 1 .. num do
Thread.Sleep(10)
for j in 1 .. 1000 do
Debug.WriteLine("x")
Console.WriteLine(String.Format("Task {0} loop: {1}, thread id {2}", taskId, i, Thread.CurrentThread.ManagedThreadId))
Update:
Adding a pure function, such as your fact, is no good. The compiler is perfectly able to see that fact has no side effects and that you duly ignore its return value, and therefore, it is perfectly cool to optimize it away. You need to do something that the compiler doesn't know how to optimize, such as Debug.WriteLine above.

Related

Why is this jump instruction so expensive when performing pointer chasing?

I have a program that performs pointer chasing and I'm trying to optimize the pointer chasing loop as much as possible.
I noticed that perf record detects that ~20% of execution time in function myFunction() is spent executing the jump instruction (used to exit out of the loop after a specific value has been read).
Some things to take note:
the pointer chasing path can comfortably fit in the L1 data cache
using __builtin_expect to avoid the cost of branch misprediction had no noticeable effect
perf record has the following output:
Samples: 153K of event 'cycles', 10000 Hz, Event count (approx.): 35559166926
myFunction /tmp/foobar [Percent: local hits]
Percent│ endbr64
...
80.09 │20: mov (%rdx,%rbx,1),%ebx
0.07 │ add $0x1,%rax
│ cmp $0xffffffff,%ebx
19.84 │ ↑ jne 20
...
I would expect that most of the cycles spent in this loop are used for reading the value from memory, which is confirmed by perf.
I would also expect the remaining cycles to be somewhat evenly spent executing the remaining instructions in the loop. Instead, perf is reporting that a large chunk of the remaining cycles are spent executing the jump.
I suspect that I can better understand these costs by understanding the micro-ops used to execute these instructions, but I'm a bit lost on where to start.

Remember that the cycles event has to pick an instruction to blame, even if both mov-load and the macro-fused cmp-and-branch uops are waiting for the result. It's not a matter of one or the other "costing cycles" while it's running; they're both waiting in parallel. (Modern Microprocessors
A 90-Minute Guide! and https://agner.org/optimize/)
But when the "cycles" event counter overflows, it has to pick one specific instruction to "blame", since you're using statistical-sampling. This is where an inaccurate picture of reality has to be invented by a CPU that has hundreds of uops in flight. Often it's the one waiting for a slow input that gets blamed, I think because it's often the oldest in the ROB or RS and blocking allocation of new uops by the front-end.
The details of exactly which instruction gets picked might tell us something about the internals of the CPU, but only very indirectly. Like perhaps something to do with how it retires groups of 4(?) uops, and this loop has 3, so which uop is oldest when the perf event exception is taken.
The 4:1 split is probably significant for some reason, perhaps because 4+1 = 5 cycle latency of a load with a non-simple addressing mode. (I assume this is an Intel Sandybridge-family CPU, perhaps Skylake-derived?) Like maybe if data arrives from cache on the same cycle as the perf event overflows (and chooses to sample), the mov doesn't get the blame because it can actually execute and get out of the way?
IIRC, BeeOnRope or someone else found experimentally that Skylake CPUs would tend to let the oldest un-retired instruction retire after an exception arrives, at least if it's not a cache miss. In your case, that would be the cmp/jne at the bottom of the loop, which in program order appears before the load at the top of the next iteration.

Delphi Indy TIdHTTP.Get() high CPU load

I saw a similar question here, but no useful answer, so let me please write it again.
My application is using too much CPU, for a test example I pick a slow site (https://www.hao123.com/) from top10 slowest sites, and read it with 100 threads simultaneously.
There is no response processing; one request is taking about 5 seconds, so it looks logical for me that my threads should use about 0% CPU being almost all the time waiting for a response.
procedure mth.Execute;
var
tm1: dword;
begin
HTTP:=TIdHTTP.Create;
HTTP.ConnectTimeout:=60000;
HTTP.ReadTimeout:=60000;
ssl:=TIdSSLIOHandlerSocketOpenSSL.Create;
HTTP.IOHandler:=ssl;
HTTP.HandleRedirects:=true;
HTTP.ProtocolVersion:=pv1_1;
repeat
sleep(5);
If StartWork then begin
tm1:=TimeGetTime;
s:=HTTP.Get('https://www.hao123.com/');
GlobalTiming:=(GlobalTiming * 9 + (TimeGetTime-tm1)) / 10;
end;
until Terminated;
HTTP.Free;
ssl.Free;
end;
The test application starts creating threads with StartWork=false. As long as I dont set StartWork:=true, CPU load is about 0%
UPDATE: to answer comments below: 100 threads running sleep(5) cycle DO NOT load CPU
As soon as I start readers by setting StartWork:=true, I see 10% CPU load on my 16-core Ryzen. When running on a 1-core VDS, this turns into a really painful problem.
The question is: how is a simple operation which should just wait, actually using that much CPU? How to "optimize" it?
UPDATE2:
Hard to explain that the issue has nothing to do with the sleep(5) so 2 more pictures:
I've replaced sleep(5) with sleep(100 + random(100))
The picture from 2-cores vds:

Why is Async version slower than single threaded version?

I am reading a large XML file using XmlReader and am exploring potential performance improvements via Async & pipelining. The following initial foray into the world of Async is showing that the Async version (which for all intents and purposes at this point is the equivalent of the Synchronous version) is much slower. Why would this be? All I've done is wrapped the "normal" code in an Async block and called it with Async.RunSynchronously
Code
open System
open System.IO.Compression // support assembly required + FileSystem
open System.Xml // support assembly required
let readerNormal (reader:XmlReader) =
let temp = ResizeArray<string>()
while reader.Read() do
()
temp
let readerAsync1 (reader:XmlReader) =
async{
let temp = ResizeArray<string>()
while reader.Read() do
()
return temp
}
let readerAsync2 (reader:XmlReader) =
async{
while reader.Read() do
()
}
[<EntryPoint>]
let main argv =
let path = #"C:\Temp\LargeTest1000.xlsx"
use zipArchive = ZipFile.OpenRead path
let sheetZipEntry = zipArchive.GetEntry(#"xl/worksheets/sheet1.xml")
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerNormal reader
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerAsync1 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
readerAsync2 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
printfn "DONE"
System.Console.ReadLine() |> ignore
0 // return an integer exit code
INFO
I am aware that the above Async code does not do any actual Async work - what I a trying to ascertain here is the overhead of simply making it Async
I don't expect it to go faster just because I've wrapped it in an Async. My question is the opposite: why the dramatic (IMHO) slowdown.
TIMINGS
A comment below correctly pointed out that I should provide timings for datasets of various sizes which is implicitly what had led me to be asking this question in the first instance.
The following are some times based on small vs large datasets. While the absolute values are not too meaningful, the relativities are interesting:
30 elements (small dataset)
Normal: 00:00:00.0006994
Async1: 00:00:00.0036529
Async2: 00:00:00.0014863
(A lot slower but presumably indicative of Async setup costs - this is as expected)
1.5 million elements
Normal: 00:00:01.5749734
Async1: 00:00:03.3942754
Async2: 00:00:03.3760785
(~ 2x slower. Surprised that the difference in timing is not amortized as the dataset gets bigger. If this is the case, then pipelining/parallelization can only improve performance here if you have more than two cores - to outweigh the overhead that I can't explain...)

There's no asynchronous work to do. In effect, all you get is the overheads and no benefits. async {} doesn't mean "everything in the braces suddenly becomes asynchronous". It simply means you have a simplified way of using asynchronous code - but you never call a single asynchronous function!
Additionaly, "asynchronous" doesn't necessarily mean "parallel", and it doesn't necessarily involve multiple threads. For example, when you do an asynchronous request to read a file (which you're not doing here), it means that the OS is told what you want to be done, and how you should be notified when it is done. When you run code like this using RunSynchronously, you're simply blocking one thread while posting asynchronous file requests - a scenario pretty much identical to using synchronous file requests in the first place.
The moment you do RunSynchronously, you throw away any reason whatsoever to use asynchronous code in the first place. You're still using a single thread, you just blocked another thread at the same time - instead of saving on threads, you waste one, and add another to do the real work.
EDIT:
Okay, I've investigated with the minimal example, and I've got some observations.
The difference is absolutely brutal with a profiler on - the non-async version is somewhat slower (up to 2x), but the async version is just never ending. It seems as if a huge number of allocations is going on - and yet, when I break the profiler, I can see that the non-async version (running in 4 seconds) makes a hundred thousand allocations (~20 MiB), while the async version (running over 10 minutes) only makes mere thousands. Maybe the memory profiler interacts badly with F# async? The CPU time profiler doesn't have this problem.
The generated IL is very different for the two cases. Most importantly, even though our async code doesn't actually do anything asynchronous, it creates a ton of async builder helpers, sprinkles a ton of (asynchronous) Delay calls through the code, and going into outright absurd territory, each iteration of the loop is an extra method call, including the setup of a helper object.
Apparently, F# automatically translates while into an asynchronous while. Now, given how well compressed xslt data usually is, very little I/O is involved in those Read operations, so the overhead absolutely dominates - and since every iteration of the "loop" has its own setup cost, the overhead scales with the amount of data.
While this is mostly caused by the while not actually doing anything, it also obviously means that you need to be careful about what you select as async, and you need to avoid using it in a case where CPU time dominates (as in this case - after all, both the async and non-async case are almost 100% CPU tasks in practice). This is further worsened by the fact that Read reads a single node at a time - something that's relatively trivial even in a big, non-compressed xml file. The overheads absolutely dominate. In effect, this is analogous to using Parallel.For with a body like sum += i - the setup cost of each of the each of the iterations absolutely dwarfs any actual work being done.
The CPU profiling makes this rather obvious - the two most work intensive methods are:
XmlReader.Read (expected)
Thread::intermediateThreadProc - also known as "this code runs on a thread pool thread". The overhead from this in a no-op code like this is around 100% - yikes. Apparently, even though there is no real asynchronicity anywhere, the callbacks are never run synchronously. Every iteration of the loop posts work to a new thread pool thread.
The lesson learned? Probably something like "don't use loops in async if the loop body does very little work". The overhead is incurred for each and every iteration of the loop. Ouch.

Asynchronous code doesn't magically make your code faster. As you've discovered, it'll tend to make isolated code slower, because there's overhead involved with managing the asynchrony.
What it can do is to be more efficient, but that's not the same as being inherently faster. The main purpose of Async is to make Input/Output code more efficient.
If you invoke a 'slow', blocking I/O operation directly, you'll block the thread until the operation returns.
If you instead invoke that slow operation asynchronously, it may free up the thread to do other things. It does require that there's an underlying implementation that's not thread-bound, but uses another mechanism for receiving the response. I/O Completion Ports could be such a mechanism.
Now, if you run a lot of asynchronous code in parallel, it may turn out to be faster than attempting to run the blocking implementation in parallel, because the async versions use fewer resources (fewer threads = less memory).

Main loop in event-driven programming and alternatives

To the best of my knowledge, event-driven programs require a main loop such as
while (1) {
}
I am just curious if this while loop can cost a high CPU usage? Is there any other way to implement event-driven programs without using the main loop?

Your example is misleading. Usually, an event loop looks something like this:
Event e;
while ((e = get_next_event()) != E_QUIT)
{
handle(e);
}
The crucial point is that the function call to our fictitious get_next_event() pumping function will be generous and encourage a context switch or whatever scheduling semantics apply to your platform, and if there are no events, the function would probably allow the entire process to sleep until an event arrives.
So in practice there's nothing to worry about, and no, there's not really any alternative to an unbounded loop if you want to process an unbounded amount of information during your program's runtime.

Usually, the problem with a loop like this is that while it's doing one piece of work, it can't be doing anything else (e.g. Windows SDK's old 'cooperative' multitasking). The next naive jump up from this is generally to spawn a thread for each piece of work, but that's incredibly dangerous. Most people would end up with an executor that generally has a thread pool inside. Then, the handle call is actually just enqueueing the work and the next available thread dequeues it and executes it. The number of concurrent threads remains fixed as the total number of worker threads in the pool and when threads don't have anything to do, they are not eating CPU.

Asynchronous vs synchronous execution. What is the difference? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 2 months ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What is the difference between asynchronous and synchronous execution?

When you execute something synchronously, you wait for it to finish before moving on to another task. When you execute something asynchronously, you can move on to another task before it finishes.
In the context of operating systems, this corresponds to executing a process or task on a "thread." A thread is a series of commands (a block of code) that exist as a unit of work. The operating system runs a given thread on a processor core. However, a processor core can only execute a single thread at once. It has no concept of running multiple threads simultaneously. The operating system can provide the illusion of running multiple threads at once by running each thread for a small slice of time (such as 1ms), and continuously switching between threads.
Now, if you introduce multiple processor cores into the mix, then threads CAN execute at the same time. The operating system can allocate time to one thread on the first processor core, then allocate the same block of time to another thread on a different processor core. All of this is about allowing the operating system to manage the completion of your task while you can go on in your code and do other things.
Asynchronous programming is a complicated topic because of the semantics of how things tie together when you can do them at the same time. There are numerous articles and books on the subject; have a look!

Synchronous/Asynchronous HAS NOTHING TO DO WITH MULTI-THREADING.
Synchronous or Synchronized means "connected", or "dependent" in some way. In other words, two synchronous tasks must be aware of one another, and one task must execute in some way that is dependent on the other, such as wait to start until the other task has completed.
Asynchronous means they are totally independent and neither one must consider the other in any way, either in the initiation or in execution.
Synchronous (one thread):
1 thread -> |<---A---->||<----B---------->||<------C----->|
Synchronous (multi-threaded):
thread A -> |<---A---->|
\
thread B ------------> ->|<----B---------->|
\
thread C ----------------------------------> ->|<------C----->|
Asynchronous (one thread):
A-Start ------------------------------------------ A-End
| B-Start -----------------------------------------|--- B-End
| | C-Start ------------------- C-End | |
| | | | | |
V V V V V V
1 thread->|<-A-|<--B---|<-C-|-A-|-C-|--A--|-B-|--C-->|---A---->|--B-->|
Asynchronous (multi-Threaded):
thread A -> |<---A---->|
thread B -----> |<----B---------->|
thread C ---------> |<------C--------->|
Start and end points of tasks A, B, C represented by <, > characters.
CPU time slices represented by vertical bars |
Technically, the concept of synchronous/asynchronous really does not have anything to do with threads. Although, in general, it is unusual to find asynchronous tasks running on the same thread, it is possible, (see below for examples) and it is common to find two or more tasks executing synchronously on separate threads... No, the concept of synchronous/asynchronous has to do solely with whether or not a second or subsequent task can be initiated before the other (first) task has completed, or whether it must wait. That is all. What thread (or threads), or processes, or CPUs, or indeed, what hardware, the task[s] are executed on is not relevant. Indeed, to make this point I have edited the graphics to show this.
ASYNCHRONOUS EXAMPLE:
In solving many engineering problems, the software is designed to split up the overall problem into multiple individual tasks and then execute them asynchronously. Inverting a matrix, or a finite element analysis problem, are good examples. In computing, sorting a list is an example. The quicksort routine, for example, splits the list into two lists and performs a quicksort on each of them, calling itself (quicksort) recursively. In both of the above examples, the two tasks can (and often were) executed asynchronously. They do not need to be on separate threads. Even a machine with one CPU and only one thread of execution can be coded to initiate processing of a second task before the first one has completed. The only criterion is that the results of one task are not necessary as inputs to the other task. As long as the start and end times of the tasks overlap, (possible only if the output of neither is needed as inputs to the other), they are being executed asynchronously, no matter how many threads are in use.
SYNCHRONOUS EXAMPLE:
Any process consisting of multiple tasks where the tasks must be executed in sequence, but one must be executed on another machine (Fetch and/or update data, get a stock quote from financial service, etc.). If it's on a separate machine it is on a separate thread, whether synchronous or asynchronous.

In simpler terms:
SYNCHRONOUS
You are in a queue to get a movie ticket. You cannot get one until everybody in front of you gets one, and the same applies to the people queued behind you.
ASYNCHRONOUS
You are in a restaurant with many other people. You order your food. Other people can also order their food, they don't have to wait for your food to be cooked and served to you before they can order.
In the kitchen restaurant workers are continuously cooking, serving, and taking orders.
People will get their food served as soon as it is cooked.

Simple Explanation via analogy
(story & pics given to help you remember).
Synchronous Execution
My boss is a busy man. He tells me to write code. I tell him: Fine. I get started and he's watching me like a vulture, standing behind me, off my shoulder. I'm like "Dude, WTF: why don't you go and do something while I finish this?"
he's like: "No, I'm waiting right here until you finish." This is synchronous.
Asynchronous Execution
The boss tells me to do it, and rather than waiting right there for my work, the boss goes off and does other tasks. When I finish my job I simply report to my boss and say: "I'm DONE!" This is Asynchronous Execution.
(Take my advice: NEVER work with the boss behind you.)

Synchronous execution means the execution happens in a single series. A->B->C->D. If you are calling those routines, A will run, then finish, then B will start, then finish, then C will start, etc.
With Asynchronous execution, you begin a routine, and let it run in the background while you start your next, then at some point, say "wait for this to finish". It's more like:
Start A->B->C->D->Wait for A to finish
The advantage is that you can execute B, C, and or D while A is still running (in the background, on a separate thread), so you can take better advantage of your resources and have fewer "hangs" or "waits".

In a nutshell, synchronization refers to two or more processes' start and end points, NOT their executions. In this example, Process A's endpoint is synchronized with Process B's start point:
SYNCHRONOUS
|--------A--------|
|--------B--------|
Asynchronous processes, on the other hand, do not have their start and endpoints synchronized:
ASYNCHRONOUS
|--------A--------|
|--------B--------|
Where Process A overlaps Process B, they're running concurrently or synchronously (dictionary definition), hence the confusion.
UPDATE: Charles Bretana improved his answer, so this answer is now just a simple (potentially oversimplified) mnemonic.

Synchronous means that the caller waits for the response or completion, asynchronous that the caller continues and a response comes later (if applicable).
As an example:
static void Main(string[] args)
{
Console.WriteLine("Before call");
doSomething();
Console.WriteLine("After call");
}
private static void doSomething()
{
Console.WriteLine("In call");
}
This will always ouput:
Before call
In call
After call
But if we were to make doSomething() asynchronous (multiple ways to do it), then the output could become:
Before call
After call
In call
Because the method making the asynchronous call would immediately continue with the next line of code. I say "could", because order of execution can't be guaranteed with asynch operations. It could also execute as the original, depending on thread timings, etc.

Sync vs Async
Sync and async operations are about execution order a next task in relation to the current task.
Let's take a look at example where Task 2 is current task and Task 3 is a next task. Task is an atomic operation - method call in a stack (method frame).
Synchronous
Implies that tasks will be executed one by one. A next task is started only after current task is finished. Task 3 is not started until Task 2 is finished.
Single Thread + Sync - Sequential
Usual execution.
Pseudocode:
main() {
task1()
task2()
task3()
}
Multi Thread + Sync - Parallel
Blocked.
Blocked means that a thread is just waiting(although it could do something useful. e.g. Java ExecutorService[About] and Future[About]) Pseudocode:
main() {
task1()
Future future = ExecutorService.submit(task2())
future.get() //<- blocked operation
task3()
}
Asynchronous
Implies that task returns control immediately with a promise to execute a code and notify about result later(e.g. callback, feature). Task 3 is executed even if Task 2 is not finished. async callback, completion handler[About]
Single Thread + Async - Concurrent
Callback Queue (Message Queue) and Event Loop (Run Loop, Looper) are used. Event Loop checks if Thread Stack is empty and if it is true it pushes first item from the Callback Queue into Thread Stack and repeats these steps again. Simple examples are button click, post event...
Pseudocode:
main() {
task1()
ThreadMain.handler.post(task2());
task3()
}
Multi Thread + Async - Concurrent and Parallel
Non-blocking.
For example when you need to make some calculations on another thread without blocking. Pseudocode:
main() {
task1()
new Thread(task2()).start();
//or
Future future = ExecutorService.submit(task2())
task3()
}
You are able use result of Task 2 using a blocking method get() or using async callback through a loop.
For example in Mobile world where we have UI/main thread and we need to download something we have several options:
sync block - block UI thread and wait when downloading is done. UI is not responsive.
async callback - create a new tread with a async callback to update UI(is not possible to access UI from non UI thread). Callback hell.
async coroutine[About] - async task with sync syntax. It allows mix downloading task (suspend function) with UI task.
[iOS sync/async], [Android sync/async]
[Paralel vs Concurrent]

I think this is bit round-about explanation but still it clarifies using real life example.
Small Example:
Let's say playing an audio involves three steps:
Getting the compressed song from harddisk
Decompress the audio.
Play the uncompressed audio.
If your audio player does step 1,2,3 sequentially for every song then it is synchronous. You will have to wait for some time to hear the song till the song actually gets fetched and decompressed.
If your audio player does step 1,2,3 independent of each other, then it is asynchronous. ie.
While playing audio 1 ( step 3), if it fetches audio 3 from harddisk in parallel (step 1) and it decompresses the audio 2 in parallel. (step 2 )
You will end up in hearing the song without waiting much for fetch and decompress.

I created a gif for explain this, hope to be helpful:
look, line 3 is asynchronous and others are synchronous.
all lines before line 3 should wait until before line finish its work, but because of line 3 is asynchronous, next line (line 4), don't wait for line 3, but line 5 should wait for line 4 to finish its work, and line 6 should wait for line 5 and 7 for 6, because line 4,5,6,7 are not asynchronous.

Simply said asynchronous execution is doing stuff in the background.
For example if you want to download a file from the internet you might use a synchronous function to do that but it will block your thread until the file finished downloading. This can make your application unresponsive to any user input.
Instead you could download the file in the background using asynchronous method. In this case the download function returns immediately and program execution continues normally. All the download operations are done in the background and your program will be notified when it's finished.

As a really simple example,
SYNCHRONOUS
Imagine 3 school students instructed to run a relay race on a road.
1st student runs her given distance, stops and passes the baton to the 2nd. No one else has started to run.
1------>
2.
3.
When the 2nd student retrieves the baton, she starts to run her given distance.
1.
2------>
3.
The 2nd student got her shoelace untied. Now she has stopped and tying up again. Because of this, 2nd's end time has got extended and the 3rd's starting time has got delayed.
1.
--2.--->
3.
This pattern continues on till the 3rd retrieves the baton from 2nd and finishes the race.
ASYNCHRONOUS
Just Imagine 10 random people walking on the same road.
They're not on a queue of course, just randomly walking on different places on the road in different paces.
2nd person's shoelace got untied. She stopped to get it tied up again.
But nobody is waiting for her to get it tied up. Everyone else is still walking the same way they did before, in that same pace of theirs.
10--> 9-->
8--> 7--> 6-->
5--> 4-->
1--> 2. 3-->

Synchronous basically means that you can only execute one thing at a time. Asynchronous means that you can execute multiple things at a time and you don't have to finish executing the current thing in order to move on to next one.

When executing a sequence like: a>b>c>d>, if we get a failure in the middle of execution like:
a
b
c
fail
Then we re-start from the beginning:
a
b
c
d
this is synchronous
If, however, we have the same sequence to execute: a>b>c>d>, and we have a failure in the middle:
a
b
c
fail
...but instead of restarting from the beginning, we re-start from the point of failure:
c
d
...this is know as asynchronous.

An example of instructions for making a breakfast:
Pour a cup of coffee.
Heat a pan, then fry two eggs.
Fry three slices of bacon.
Toast two pieces of bread.
Add butter and jam to the toast.
Pour a glass of orange juice.
If you have experience with cooking, you'd execute those instructions asynchronously. You'd start warming the pan for eggs, then start the bacon. You'd put the bread in the toaster, then start the eggs. At each step of the process, you'd start a task, then turn your attention to tasks that are ready for your attention.
Cooking breakfast is a good example of asynchronous work that isn't parallel. One person (or thread) can handle all these tasks. Continuing the breakfast analogy, one person can make breakfast asynchronously by starting the next task before the first task completes. The cooking progresses whether or not someone is watching it. As soon as you start warming the pan for the eggs, you can begin frying the bacon. Once the bacon starts, you can put the bread into the toaster.
For a parallel algorithm, you'd need multiple cooks (or threads). One would make the eggs, one the bacon, and so on. Each one would be focused on just that one task. Each cook (or thread) would be blocked synchronously waiting for the bacon to be ready to flip, or the toast to pop.
(emphasis mine)
From Asynchronous programming concepts

A synchronous operation does its work before returning to the caller.
An asynchronous operation does (most or all of) its work after returning to the caller.

You are confusing Synchronous with Parallel vs Series. Synchronous mean all at the same time. Syncronized means related to each othere which can mean in series or at a fixed interval. While the program is doing all, it it running in series. Get a dictionary...this is why we have unsweet tea. You have tea or sweetened tea.

A different english definition of Synchronize is Here
Coordinate; combine.
I think that is a better definition than of "Happening at the same time". That one is also a definition, but I don't think it is the one that fits the way it is used in Computer Science.
So an asynchronous task is not co-coordinated with other tasks, whereas a synchronous task IS co-coordinated with other tasks, so one finishes before another starts.
How that is achieved is a different question.

I think a good way to think of it is a classic running Relay Race
Synchronous: Processes like members of the same team, they won't execute until they receive baton (end of the execution of previous process/runner) and yet they are all acting in sync with each other.
Asynchronous: Where processes like members of different teams on the same relay race track, they will run and stop, async with each other, but within same race (overall program execution).
Does it make sense?

Synchronous means queue way execution one by one task will be executed. Suppose there is only vehicle that need to be share among friend to reach their destination one by one vehicle will be share.
In asynchronous case each friend can get rented vehicle and reach its destination.

In regards to the "at the same time" definition of synchronous execution (which is sometimes confusing), here's a good way to understand it:
Synchronous Execution: All tasks within a block of code are all executed at the same time.
Asynchronous Execution: All tasks within a block of code are not all executed at the same time.

Yes synchronous means at the same time, literally, it means doing work all together. multiple human/objects in the world can do multiple things at the same time but if we look at computer, it says synchronous means where the processes work together that means the processes are dependent on the return of one another and that's why they get executed one after another in proper sequence. Whereas asynchronous means where processes don't work together, they may work at the same time(if are on multithread), but work independently.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex