Will you provide a 'crayola' explanation of the OpenCL function atomic_inc? - opencl

If I had all of the information needed to answer this question I probably wouldn't need to ask it, but at least I have the creating company's specification link.
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/atomic_inc.html
I've read quite a few definitions but am seeking a basic level explanation of what this function does.

It atomically increments a 32 bit value and stores it in the same location in memory. You’d want to atomically increment if you have multiple threads reading and writing to the same location. This makes sure that only one thread will increment at a time.
croninsiglos. Reddit. ELI5: What does the OpenCl function atomic_inc do ? asked by John Lutz on 22 Jun 2020.

Related

Trigger a function after 5 API calls return (in a distributed context)

My girlfriend was asked the below question in an interview:
We trigger 5 independent APIs simultaneously. Once they have all completed, we want to trigger a function. How will you design a system to do this?
My girlfriend replied she will use a flag variable, but the interviewer was evidently not happy with it.
So, is there a good way in which this could be handled (in a distributed context)? Note that each of the 5 API calls are made by different servers and the function to be triggered is on a 6th server.
The other answers suggesting Promises seem to assume all these requests necessarily come from the same client. If the context here is distributed systems, as you said it is, then I don't think those are valid answers. If they were, then the interview question would have nothing to do with distributed systems, except to essay your girlfriend's ability to recognize something that isn't really a distributed systems problem.
And the question does have the shape of some classic problems in distributed systems. It sounds a lot like YouTube view counting: How do you achieve qualities like atomicity and consistency in a multi-threaded, multi-process, or multi-client environment? Failing to recognize this, thinking the answer could be as simple as "a flag", betrayed a lack of experience in distributed systems.
Another thing about that answer is that it leaves many ambiguities. Where does the flag live? As a variable in another (Java?) API? In a database? In a file? Even in a non-distributed context, these are important questions. And if she had gone on to address these questions, even being innocent of all the distributed systems complications, she might have happily fallen into a discussion of the kinds of D.S. problems that occur when you use, say, a file; and how using a ACID-compliant database might solve those problems, and what the tradeoffs might be there... And she might have corrected herself and said "counter" instead of "flag"!
If I were asked this, my first thought would be to use promises/futures. The idea behind them is that you can execute time-consuming operations asynchronously and they will somehow notify you when they've completed, either successfully or unsuccessfully, typically by calling a callback function. So the first step is to spawn five asynchronous tasks and get five promises.
Then I would join the five promises together, creating a unified promise that represents the five separate tasks. In JavaScript I might call Promise.all(); in Java I would use CompletableFuture.allOf().
I would want to make sure to handle both success and failure. The combined promise should succeed if all of the API calls succeed and fail if any of them fail. If any fail there should be appropriate error handling/reporting. What happens if multiple calls fail? How would a mix of successes and failures be reported? These would be design points to mention, though not necessarily solve during the interview.
Promises and futures typically have modular layering system that would allow edge cases like timeouts to be handled by chaining handlers together. If done right, timeouts could become just another error condition that would be naturally handled by the error handling already in place.
This solution would not require any state to be shared across threads, so I would not have to worry about mutexes or deadlocks or other thread synchronization problems.
She said she would use a flag variable to keep track of the number of API calls have returned.
One thing that makes great interviewees stand out is their ability to anticipate follow-up questions and explain details before they are asked. The best answers are fully fleshed out. They demonstrate that one has thought through one's answer in detail, and they have minimal handwaving.
When I read the above I have a slew of follow-up questions:
How will she know when each API call has returned? Is she waiting for a function call to return, a callback to be called, an event to be fired, or a promise to complete?
How is she causing all of the API calls to be executed concurrently? Is there multithreading, a fork-join pool, multiprocessing, or asynchronous execution?
Flag variables are booleans. Is she really using a flag, or does she mean a counter?
What is the variable tracking and what code is updating it?
What is monitoring the variable, what condition is it checking, and what's it doing when the condition is reached?
If using multithreading, how is she handling synchronization?
How will she handle edge cases such API calls failing, or timing out?
A flag variable might lead to a workable solution or it might lead nowhere. The only way an interviewer will know which it is is if she thinks about and proactively discusses these various questions. Otherwise, the interviewer will have to pepper her with follow-up questions, and will likely lower their evaluation of her.
When I interview people, my mental grades are something like:
S — Solution works and they addressed all issues without prompting.
A — Solution works, follow-up questions answered satisfactorily.
B — Solution works, explained well, but there's a better solution that more experienced devs would find.
C — What they said is okay, but their depth of knowledge is lacking.
F — Their answer is flat out incorrect, or getting them to explain their answer was like pulling teeth.

Could you implement async-await by memcopying stack frames rather than creating state machines?

I am trying to understand all the low-level stuff Compilers / Interpreters / the Kernel do for you (because I'm yet another person who thinks they could design a language that's better than most others)
One of the many things that sparked my curiosity is Async-Await.
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines.
I've not found anything useful by googling ("async copy stack frame" and variations) or in the "Similar questions" section.
To me, this method seems rather complicated and overhead-heavy;
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
I'm aware that it is architecturally impossible for some languages (I thank the CLR can't do it, so C# can't either).
Am I missing something that makes this logically impossible? I would expect less complicated code and a performance boost from doing it that way, am I mistaken? I suppose when you have a deep stack hierarchy after a async call (eg. a recursive async function) the amount of data you would have to memcopy is rather large, but there are probably ways to work around that.
If this is possible, then why isn't it done anywhere?
Yes, an alternative to converting code into state machines is copying stacks around. This is the way that the go language does it now, and the way that Java will do it when Project Loom is released.
It's not an easy thing to do for real-world languages.
It doesn't work for C and C++, for example, because those languages let you make pointers to things on the stack. Those pointers can be used by other threads, so you can't move the stack away, and even if you could, you would have to copy it back into exactly the same place.
For the same reason, it doesn't work when your program calls out to the OS or native code and gets called back in the same thread, because there's a portion of the stack you don't control. In Java, project Loom's 'virtual threads' will not release the thread as long as there's native code on the stack.
Even in situations where you can move the stack, it requires dedicated support in the runtime environment. The stack can't just be copied into a byte array. It has to be copied off in a representation that allows the garbage collector to recognize all the pointers in it. If C# were to adopt this technique, for example, it would require significant extensions to the common language runtime, whereas implementing state machines can be accomplished entirely within the C# compiler.
I would first like to begin by saying that this answer is only meant to serve as a starting point to go in the actual direction of your exploration. This includes various pointers and building up on the work of various other authors
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines
You understood correctly that the Async/Await implementation for C# and Rust use state machines. Let us understand now as to why are those implementations chosen.
To put the general structure of stack frames in very simple terms, whatever we put inside a stack frame are temporary allocations which are not going to outlive the method which resulted in the addition of that stack frame (including, but not limited to local variables). It also contains the information of the continuation, ie. the address of the code that needs to be executed next (in other words, the control has to return to), within the context of the recently called method. If this is a case of synchronous execution, the methods are executed one after the other. In other words, the caller method is suspended until the called method finishes execution. This, from a stack perspective fits in intuitively. If we are done with the execution of a called method, the control is returned to the caller and the stack frame can be popped off. It is also cheap and efficient from a perspective of the hardware that is running this code as well (hardware is optimised for programming with stacks).
In the case of asynchronous code, the continuation of a method might have to trigger several other methods that might get called from within the continuation of callers. Take a look at this answer, where Eric Lippert outlines the entirety of how the stack works for an asynchronous flow. The problem with asynchronous flow is that, the method calls do not exactly form a stack and trying to handle them like pure stacks may get extremely complicated. As Eric says in the answer, that is why C# uses graph of heap-allocated tasks and delegates that represents a workflow.
However, if you consider languages like Go, the asynchrony is handled in a different way altogether. We have something called Goroutines and here is no need for await statements in Go. Each of these Goroutines are started on their own threads that are lightweight (each of them have their own stacks, which defaults to 8KB in size) and the synchronization between each of them is achieved through communication through channels. These lightweight threads are capable of waiting asynchronously for any read operation to be performed on the channel and suspend themselves. The earlier implementation in Go is done using the SplitStacks technique. This implementation had its own problems as listed out here and replaced by Contigious Stacks. The article also talks about the newer implementation.
One important thing to note here is that it is not just the complexity involved in handling the continuation between the tasks that contribute to the approach chosen to implement Async/Await, there are other factors like Garbage Collection that play a role. GC process should be as performant as possible. If we move stacks around, GC becomes inefficient because accessing an object then would require thread synchronization.
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
In short, you can. As this answer states here, Chicken Scheme uses a something similar to what you are exploring. It begins by allocating everything on the stack and move the stack values to heap when it becomes too large for the GC activities (Chicken Scheme uses Generational GC). However, there are certain caveats with this kind of implementation. Take a look at this FAQ of Chicken Scheme. There is also lot of academic research in this area (linked in the answer referred to in the beginning of the paragraph, which I shall summarise under further readings) that you may want to look at.
Further Reading
Continuation Passing Style
call-with-current-continuation
The classic SICP book
This answer (contains few links to academic research in this area)
TLDR
The decision of which approach to be taken is subjective to factors that affect the overall usability and performance of the language. State Machines are not the only way to implement the Async/Await functionality as done in C# and Rust. Few languages like Go implement a Contigious Stack approach coordinated over channels for asynchronous operations. Chicken Scheme allocates everything on the stack and moves the recent stack value to heap in case it becomes heavy for its GC algorithm's performance. Moving stacks around has its own set of implications that affect garbage collection negatively. Going through the research done in this space will help you understand the advancements and rationale behind each of the approaches. At the same time, you should also give a thought to how you are planning on designing/implementing the other parts of your language for it be anywhere close to be usable in terms of performance and overall usability.
PS: Given the length of this answer, will be happy to correct any inconsistencies that may have crept in.
I have been looking into various strategies for doing this myseøf, because I naturally thi k I can design a language better than anybody else - same as you. I just want to emphasize that when I say better, I actually mean better as in tastes better for my liking, and not objectively better.
I have come to a few different approaches, and to summarize: It really depends on many other design choices you have made in the language.
It is all about compromises; each approach has advantages and disadvantages.
It feels like the compiler design community are still very focused on garbage collection and minimizing memory waste, and perhaps there is room for some innovation for more lazy and less purist language designers given the vast resources available to modern computers?
How about not having a call stack at all?
It is possible to implement a language without using a call stack.
Pass continuations. The function currently running is responsible for keeping and resuming the state of the caller. Async/await and generators come naturally.
Preallocated static memory addresses for all local variables in all declared functions in the entire program. This approach causes other problems, of course.
If this is your design, then asymc functions seem trivial
Tree shaped stack
With a tree shaped stack, you can keep all stack frames until the function is completely done. It does not matter if you allow progress on any ancestor stack frame, as long as you let the async frame live on until it is no longer needed.
Linear stack
How about serializing the function state? It seems like a variant of continuations.
Independent stack frames on the heap
Simply treat invocations like you treat other pointers to any value on the heap.
All of the above are trivialized approaches, but one thing they have in common related to your question:
Just find a way to store any locals needed to resume the function. And don't forget to store the program counter in the stack frame as well.

why use pointers

I'm learning C++ on my own. I'm an EE and learned it about 20 years ago, but in the progress of my career I stopped programming and didn't take it up again until recently. I should add that I never took any classes in programming.
I have a theoretical question about pointers. In reading the books about pointers it seems they have an important role in C++. My problem is that I can't see what that is. I see that pointers have a role in arrays, but I can't see their role in anything else.
I can see what they do, but I don't see why use pointers in the situations I see them in. Either references or straight variables would work just as well. I have a feeling the answer lies in the area of memory ( it's optimal use), but I just don't know.
Any answers would be appreciated. Thanks.
Consider the following from cplusplus.com:
"[T]here may be cases where the memory needs of a program can only be
determined during runtime. For example, when the memory needed depends
on user input. On these cases, programs need to dynamically allocate
memory, for which the C++ language integrates the operators new and
delete."
If you could determine all your memory needs prior to run time and did not need to make use of any abstract data type like a linked list, then yes, it would be difficult to see their use. However, what if you want to store values in an array, but you don't yet know how big that array will need to be?
Another value of pointers arises when you consider passing values from function to function. You may find this thread of value regarding the differences between pointers and references in C++ and how/why to use each.
We have been having several pedagogical conversations focused on pointers on the CSEducators.SE site. I'd encourage you to read those as well:
Simple Pointer Examples in C
Lesson Idea: Arrays, Pointers, and Syntactic Sugar
Pointers come from C, which had no concept of reference, and which C++ inherited from.
Everything that can be done with a reference in C++ is done with a pointer in C.
I find this question really great because it is pure.
A programming language is considered "safe" when the programs written in it can only call functions and access data that the program can name.
Now, the concept of pointer was invented to break this sandbox of safety and provide developer with freedom to think and act outside of the box.
Think of pointers as poor man's tool to achieve something not provided by the programming language itself.
It is misleading to think you could achieve higher performance if programmed some algorithm using pointers. Optimization is privilege of the compiler and hardware, not human.

Write fast Common Lisp code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm not sure, if some weird things make my Code faster:
Is it normally better to use inbuilt operations or write new specialized functions, that do the same thing?
(for example a version of #'map only for vectors; my version is often faster without type declarations)
Should I define new (complicated) types to use them in declarations?
(for example a typed list)
Should I define slots directly to an object? (for example px and py for a 2-dimensional object, or is it ok to use one slot pos of type vector, that I could reuse it for other purposes)
There are a few parts to this but here is a quick braindump
PROFILE!
Use a distribution of CL that has a profiler built in, I use sbcl for example http://www.sbcl.org/1.0/manual/Statistical-Profiler.html
The nice thing about the sbcl profiler is that once you have profiled a function, if you disassemble it, the machine code is annotated with statistics. This requires some knowledge of the target machine code.
Do not underestimate your implementation: They can have advanced type and flow analysis built in and are able to, for example, pick a vector only version of map when it makes sense.
Learn compiler macros: compiler macros can shadow functions this gives you a place to put extra optimizations based on the context of the form. IT does this without replacing the function so it can still be used in a higher order way.
Learn Type declarations
I found this series of blog posts helped me understand this technique http://nklein.com/tags/optimization/page/2/ Read em all!
ONE MASSIVE NOTE: Don't ever lie to your compiler about a type. Type declarations are a way of telling your compiler you know what the type is the compiler doesn't even have to use them, and when it does it doesn't have to check you are giving it the correct thing.
Unboxed data
Some implementations are able to unbox certain datatype in certain conditions. Sorry that is vague but you will need to read up for your implementation. For sbcl the 'sbcl internals' guide is very helpful.
For example:
(make-array 100 :element-type 'single-float :initial-element 0.0)
Can be stored as a contiguous block of memory in sbcl.
PROFILE AGAIN (With realistic data)
I spent 3 hours writing a crazy compiler macro based n dimensional matrix multiplication routine and then tested it against a 1 line built in solution. For matricies below 5 dimensions there was not a big difference! For higher dimensions, yeah It rocked but that 'performance benefit' is purely academic because those code paths were never touched. Luckily I undertook the task for fun as I was asking the same question you are now.
Algorithms
All the type specifiers in the world won't give you a 100times performance increase. This comes from better techniques. Read on the maths behind the problem, implements different helper functions that have different strengths and choose between them at runtime...then go back and use compiler macros to allow lisp to choose at compile time. OR specify the technique as a higher order functions, for example make-hash-table allows you to specify the hashing function and rehash sizes, this can be crucial in getting good performance at certain sizes.
Know the limits of BigO
Algorithmic complexity means nothing if you loose all the of performance due to memory locality issues. Conversly sometime we can achieve superlinear performance characteristics if, by spliting the problem among cores, the reduced dataset now fits in the l2 cache.
BigO is a great metric but it isn't the end of the story. This is the reason assoc lists are a totally valid alternative to hash-tables for low numbers of keys and certain access profiles.
Summary
There is a golden mantra I heard from somewhere in the lisp community that works so well:
Make it Fast and then make it Fast
If nothing else follow this. Chant it to yourself!
Get the program up and running quickly, in doing so you are more likely to spot the places where you can use a better technique or algorithm to get your several-orders-of-magnitude improvement. Do use CL's own functions first. Don't trade lisp's higher order nature too early by using macros, explore how far you can go with functions.
[Edit] More notes - the following is for sbcl
Type definitions on struct slots are used for optimizing, type declarations for class slots are not.
With regard to types, start with what makes the program easy to write and understand (Make it fast) and then look into access times if it is the bottleneck (make It Fast!)
(slot-value x 'name) is very fast when name is known. Look at how with-slots uses symbol-macrolet to it's advantage
So to kinda directly answer your original question:
built in first (also check libraries)
does it make the problem easier to write and understand?
use pos. By the time the performance of that indirection becomes and issue you will have found a dozen other ways to speed up the problem and the solution will be part of a wider optimization technique.

What measurement do you use in your development process to determine *Doneness* of your software? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I just finished listening to a very eye-opening podcast on Hanselminutes about the definition of "Done". So my question to everyone is "When do you consider a piece of software to be "Done"? Is it when it's fully unit tested? Is it when it's completely documented? What measurement do you use in your development process to determine Doneness of your software?
When the check clears?
Seriously, every time you write a piece of software, you should have defined what "done" means. First. If you have a customer, then there should be a contract -- specific, measurable, agreed, and testable -- that defines done.
If you don't know where you're going, how will you know when you get there?
Surely dependent on context and purpose of the software?
Lunar Lander (the real thing) would have a very different definition of Done to Lunar Lander the Flash game.
Where I work, DONE is defined by a committee of non-technical managers. You can imagine the fun and games.
Test, unit test, integration test, webtest, peer QA and end user review in the sprint review. Peer QA decides if anything else is necessary, all tests must pass in CI environment. This is in a scrum web-project.
When they client(1) considers it done, it's checked in, backed up, and documented.
Also: "done" rarely exists in web dev.
(1) where client may be an internal PM or such
A good measurement is code churn. Using your source code control software, measure the rate of change. How many lines of code are being removed/added/changed per day. Graph this over time. As you approach being ready to release, this should trend downwards and give an indication of stability and readiness to ship. This assumes that you are actually testing well and making changes to fix bugs or respond to change requests. If your user acceptance test users and integration/unit test activity are continuing to regress and test and you aren't having to make code changes (because they aren't finding anything necessitating a change) then you are probably ready to ship.
If big chunks of code are churning a few days before an arbitrary or externally driven ship date, look out!
When the software can be used to satisfy the requirements that define the system.
But I've always thought, "software is never done, it just reaches an acceptable level of incompleteness."
From a development viewpoint 'done' is described quite well by my friend and mentor Simon Baker, here
Alistair Cockburn, Jeff Patton and Mike Cohn also have the following collected views
Shippable quality, which has to be exercised in a go-live, forces teams to really focus on ensuring that incremental work is more carefully thought through.
'Done' is something which all the above quoted would be the first to agree is always different per team and project; however to satisfy knowing that a given piece of work is done, the team must conduct an exercise at the start to flesh out the measure of done-ness and list those criteria.
In so doing, everyone has agreed by consensus what an acceptable completion point is - whether that includes noting the Task in Excel, or writing documentation (or not) becomes an implementation detail for that team/project. The overriding thing is that everyone's understanding of Done is uniform.
Equally, assuming you reach that definition by consensus, it can also be changed as required by consensus.
When all of the requirements are met and all the tests pass.
It's never done, simply versioned and released.
Each project will have it's own definition of done, ours is code complete (compiles successfully, etc), unit tested (or some kind of local testing if not possible) and released within one of our packages (so it's available to the other teams).
But the MOST important thing in DoD is every parties should agree on what it is (team, product owner, manager, etc) and it should be some kind of public contract, published in a team portal is a good idea.
Any piece of software at any time is always 80% done. At least, that's what my experience teaches ...
When the customer thinks it is.

Resources