I know GC wasn't popular in the days when Ada was developed and for the main use case of embedded programming it is still not a good choice.
But considering that Ada is a general purpose programming language why wasn't a partial and optional (traces only explicitly tagged memory objects) garbage collector introduced in later revisions of the language and the compiler implementations.
I simply can't think of developing a normal desktop application without a garbage collector anymore.
Ada was designed with military applications in mind. One of the big priorities in its design was determinism. i.e. one wanted an Ada program to consistently perform exactly the same way every time, in any environment, under all operating systems... that kinda thing.
A garbage collector turns one application into two, working against one another. Java programs develop hiccups at random intervals when the GC decides to go to work, and if it's too slow about it there's a chance that an application will run out of heap sometimes and not others.
Simplified: A garbage collector introduces some variability into a program that the designers didn't want. You make a mess - you clean it up! Same code, same behavior every time.
Not that Ada became a raging worldwide success, mind you.
Because Ada was designed for use in defense systems which control weapons in realtime, and garbage collection interferes with the timing of your application. This is dangerous which is why, for many years, Java came with a warning that it was not to be used for healthcare and military control systems.
I believe that the reason there is no longer such a disclaimer with Java is because the underlying hardware has become much faster as well as the fact that Java has better GC algorithms and better control over GC.
Remember that Ada was developed in the 1970's and 1980's at a time when computers were far less powerful than they are today, and in control applications timing issues were paramount.
First off, there is nothing in the language really that prohibits garbage collection.
Secondly some implementations do perform garbage collection. In particular, all the implementations that target the JVM garbage collect.
Thirdly, there is a way to get some amount of garbage collection with all compilers. You see, when an access type goes out of scope, if you specifially told the language to set aside a certian amount of space for storage of its objects, then that space will be destroyed at that point. I've used this in the past to get some modicum of garbage collection. The declaration voodo you use is:
type Foo is access Blah;
for Foo'storage_size use 100_000_000; --// 100K
If you do this, then all (100K of) memory allocated to Blah objects pointed to by Foo pointers will be cleaned up when the Foo type goes out of scope. Since Ada allows you to nest subroutines inside of other subroutines, this is particularly powerful.
To see more about what storage_size and storage pools can do for you, see LRM 13.11
Fourthly, well-written Ada programs don't tend to rely on dynamic memory allocation nearly as much as C programs do. C had a number of design holes that practicioners learned to use pointers to paint over. A lot of those idioms aren't nessecary in Ada.
the answer is more complicated: Ada does not require a garbage collector, because of real-time constraints and such. however, the language have been cleverly designed so as to allow the implementation of a garbage collector.
although, many (almost all) compilers do not include a garbage collector, there are some notable implementation:
a patch for GNAT
Ada compilers targeting the Java Virtual Machine (i don't know if those projects are still supported). It used the garbage collector of the JVM.
there are plenty other sources about garbage collection in Ada around the web. this subject has been discussed at length, mainly because of the fierce competition with Java in the mid '90s (have a look at this page: "Ada 95 is what the Java language should have been"), when Java was "The Next Big Thing" before Microsoft drew C#.
First off, I'd like to know who's using Ada these days. I actually like the language, and there's even a GUI library for Linux/Ada, but I haven't heard anything about active Ada development for years. Thanks to its military connections, I'm really not sure if it's ancient history or so wildly successful that all mention of its use is classified.
I think there's a couple of reason for no GC in Ada. First, and foremost, it dates back to an era where most compiled languages used primarily stack or static memory, or in a few cases, explicit heap allocate/free. GC as a general philosophy really only took off about 1990 or so, when OOP, improved memory management algorithms and processors powerful enough to spare the cycles to run it all came into their own. What simply compiling Ada could do to an IBM 4331 mainframe in 1989 was simply merciless. Now I have a cell phone that can outperform that machine's CPU.
Another good reason is that there are people who think that rigorous program design includes precise control over memory resources, and that there shouldn't be any tolerance for letting dynamically-acquired objects float. Sadly, far too many people ended up leaking memory as dynamic memory became more and more the rule. Plus, like the "efficiency" of assembly language over high-level languages, and the "efficiency" of raw JDBC over ORM systems, the "efficiency" of manual memory management tends to invert as it scales up (I've seen ORM benchmarks where the JDBC equivalent was only half as efficient). Counter-intuitive, I know, but these days systems are much better at globally optimizing large applications, plus they're able to make radical re-optimizations in response to superficially minor changes.Including dynamically re-balancing algorithms on the fly based on detected load.
I'm afraid I'm going to have to differ with those who say that real-time systems can't afford GC memory. GC is no longer something that freezes the whole system every couple of minutes. We have much more intelligent ways to reclaim memory these days.
Your question is incorrect. It does. See the package ada.finalization which handles GC for you.
I thought I'd share a really simple example of how to implement a Free() procedure (which would be used in a way familiar to all C programmers)...
with Ada.Integer_Text_IO, Ada.Unchecked_Deallocation;
use Ada.Integer_Text_IO;
procedure Leak is
type Int_Ptr is access Integer;
procedure Free is new Ada.Unchecked_Deallocation (Integer, Int_Ptr);
Ptr : Int_Ptr := null;
begin
Ptr := new Integer'(123);
Free (Ptr);
end Leak;
Calling Free at the end of the program will return the allocated Integer to the Storage Pool ("heap" in C parlance). You can use valgrind to demonstrate that this does in fact prevent 4 bytes of memory being leaked.
The Ada.Unchecked_Deallocation (a generically defined procedure) can be used on (I think) any type that may be allocated using the "new" keyword. The Ada Reference Manual ("13.11.2 Unchecked Storage Deallocation") has more details.
Related
I tried googling and read some snippets online. Why is Ada a "safety critical" language? Some things I notice are
No pointers
Specify a range (this type is an integer but can only be 1-12)
Explicitly state if a function parameter is out or in/out
Range-based loops (To avoid bound errors or bound checking)
The rest of the syntax I either didn't understand or didn't see how it helps it to be 'safety critical'. These are some points but I don't see the big picture. Does it have design-by-contract that I am not seeing? Does it have rules that make it harder for code to compile (and if so what are some?) Why is it a 'safety critical' language?
Well, this is pretty simple. The reason there's a lot of Ada sytax that doesn't seem to have much of anything to do with making the language "safety-critical" (whatever that means for a language) is that this was not Ada's design goal. It was designed to be a general-purpose compiled system's programming language, sufficently capable that the U.S. Department of Defense could get rid of all its little one-use languages it had to support all over the place.
The fact that the end result is a language that is rather useful for safety-critical applications was just a happy side-effect of the fact that the language was very well-designed with military applications (where lives are often staked on the software's reliability) in mind.
Surprisingly few other modern languages had support for building reliable software as a design goal. Most seem to be cooked up by a lone "genius" hacker, with the chief goal being the ability to facilitate cranking out lots of code quickly, perhaps in some new way that hacker favors.
All of those are good for safety-critical application; but consider also the ability to assign a layout (down to the bits) and the ability to [optionally] specify that such a record can ONLY be at a certain location (useful for things like video-memory mappings).
Consider that a lot of safety-critical applications are also without standard (in the senses both of "wide-spread" and of forward-comparability) interfaces; example: nuclear reactors, rocket engines (the engineering itself differs from generation to generation*), models-of-aircraft.
The upcoming Ada 2012 standard DOES have actual contracts, in the form of pre- and post-conditions; example (taken from http://www2.adacore.com/wp-content/uploads/2006/03/Ada2012_Rational_Introducion.pdf):
generic
type Item is private;
package Stacks is
type Stack is private;
function Is_Empty(S: Stack) return Boolean;
function Is_Full(S: Stack) return Boolean;
procedure Push(S: in out Stack; X: in Item)
with
Pre => not Is_Full(S),
Post => not Is_Empty(S);
procedure Pop(S: in out Stack; X: out Item)
with
Pre => not Is_Empty(S),
Post => not Is_Full(S);
Stack_Error: exception;
private
-- Private portion.
end Stacks;
Also, another thing that gets glossed over, is the ability to exclude Null from your Access/pointer types; this is useful in that you can a) specify that exclusion in your subprogram parameters, and b) to streamline your algorithm [since you don't have to check for null at every instance-of-use], and c) letting your exception handler handle the (I assume) exceptional circumstance of a Null.
* The Arianne 5 disaster occurred precisely because the management disregarded this fact and had the programmers use the incorrect specifications: that of the Arianne 4.
AdaCore has a nice presentation of various safety features of Ada 2005 here:
http://www.adacore.com/knowledge/technical-papers/safe-secure/
The US Government and industry also ran studies on program reliability a long time ago that compared languages. I couldn't find one very quickly as the sites are all old (!) but here's a quote from DDCI's website you: " In studies conducted during the eighties Ada consistently outperformed established programming languages like Pascal, Fortran, and C. In the nineties, Ada continues to surpass C++ in performance evaluations measuring capability, efficiency, maintenance, risk, and lifecycle cost."
Lists reasons they used it on Commanche project in link below. I'll add that the platform implementations have been around for a LONG time and stayed stable. Like a source in article said, maintenance is where majority of costs come in. We've seen the modern contenders .NET and Java change like crazy. Long-term stability of Ada is better for safety-critical apps which are often fielded for long periods (sometimes decades).
http://www.ddci.com/displayNews.php?fn=programs_rah66.php
Another benefit is Ada was designed for cross-language development. I keep seeing in the news people talking about how .NET and JVM are innovative b/c they let you mix the "right tools for the job" into one system. Ada's had that ability for a long time. It's common for apps to be a mix of Ada, C, C++, assembler, etc. (MULTOS CA comes to mind.) And they still function well.
It hasn't been static either. They keep updating the language, most recently in 2012. Its portability allowed it to run on both JVM and .NET too for people that want the libraries or have plenty existing code on those. There's also Ada development tools and robust runtimes for many OS's and RTOS's from IBM, Aonix, AdaCore, and Green Hills.
Last benefit: if it will compile, it will work. Usually.
I know this is late, but it's an example I recently encountered. I wrote the following C++ function that worked fine in -O0:
size_t get_index(const Monomial & t, const Monomial & u) {
get_index(t, u.log()); // forgot to type "return" first...
}
This actually compiles, and while it might emit a warning if you're lucky to have a decent compiler, you're not likely to see it when it's one of a lot of programs being compiled. Miraculously, it ran just fine when I compiled it with -O0. But when I compiled it in -O3 it crashed every time, and for the life of me I couldn't figure out why, because I didn't see the warning (if it even appeared). Imagine debugging that when you think you imagine a return there simply because you know your intent.
Likewise, back when I was learning C I frequently made this mistake:
int a;
scanf("%d", a); /* left out & before the a */
Using int's for pointers and vice versa is considered normal programming practice in C, so much so that compilers 25 years ago didn't even bother to emit a warning. Heck, that was a feature of C, not a bug. (See, for instance, Brian Kernaghan's "Why Pascal is not my favorite Language.") And of course back then home computer OS's didn't have memory protection; if you were lucky the computer wasn't writing to hard disk when it reset.
These kinds of mistakes won't even compile in Ada. Functions have to return a value, and you cannot accidentally use an Integer in place of an access Integer (i.e., pointer to integer).
And that's just the start!
Ada has superior support for real-time http://www.adaic.org/resources/add_content/standards/05rat/html/Rat-1-3-4.html . This allows the programmer to implement a greater degree of determinism rather than worry about the technical details of the programming language itself. While many of the run-time features supported by Ada can be achieved in C with some deep understanding of things, Ada achieves most real-time features very well since it's standardized. Ada even has a Ravenscar profile http://en.wikipedia.org/wiki/Ravenscar_profile and SPARK a computer language where "highly reliable operation is essential" is based on a subset of Ada 83 and 95 http://en.wikipedia.org/wiki/SPARK_%28programming_language%29. My guess is that SPARK does not have a version for later Ada versions b/c it is too early to tell how safe the newer versions really are. It's also mentioned in the latter article that Ada can be optimized for speeds rivaling C which would be important for real-time applications that relied on precise control during rapidly changing events. There are many built in standard features for real-time control which are obviously important for a 'safety critical' language.
I was wondering if it is possible to create a programming language without explicit memory allocation/deallocation (like C, C++ ...) AND without garbage collection (like Java, C#...) by doing a full analysis at the end of each scope?
The obvious problem is that this would take some time at the end of each scope, but I was wondering if it has become feasible with all the processing power and multiple cores in current CPU's. Do such languages exist already?
I also was wondering if a variant of C++ where smart pointers are the only pointers that can be used, would be exactly such a language (or am I missing some problems with that?).
Edit:
Well after some more research apparently it's this: http://en.wikipedia.org/wiki/Reference_counting
I was wondering why this isn't more popular. The disadvantages listed there don't seem quite serious, the overhead should be that large according to me. A (non-interpreted, properly written from the ground up) language with C family syntax with reference counting seems like a good idea to me.
The biggest problem with reference counting is that it is not a complete solution and is not capable of collecting a cyclic structure. The overhead is incurred every time you set a reference; for many kinds of problems this adds up quickly and can be worse than just waiting for a GC later. (Modern GC is quite advanced and awesome - don't count it down like that!!!)
What you are talking about is nothing special, and it shows up all the time. The C or C++ variant you are looking for is just plain regular C or C++.
For example write your program normally, but constrain yourself not to use any dynamic memory allocation (no new, delete, malloc, or free, or any of their friends, and make sure your libraries do the same), then you have that kind of system. You figure out in advance how much memory you need for everything you could do, and declare that memory statically (either function level static variables, or global variables). The compiler takes care of all the accounting the normal way, nothing special happens at the end of each scope, and no extra computation is necessary.
You can even configure your runtime environment to have a statically allocated stack space (this one isn't really under the compiler's control, more linker and operating system environment). Just figure out how deep your function call chain goes, and how much memory it uses (with a profiler or similar tool), an set it in your link options.
Without dynamic memory allocation (and thus no deallocation through either garbage collection or explicit management), you are limited to the memory you declared when you wrote the program. But that's ok, many programs don't need dynamic memory, and are already written that way. The real need for this shows up in embedded and real-time systems when you absolutely, positively need to know exactly how long an operation will take, how much memory (and other resources) it will use, and that the running time and the use of those resources can't ever change.
The great thing about C and C++ is that the language requires so little from the environment, and gives you the tools to do so much, that smart pointers or statically allocated memory, or even some special scheme that you dream up can be implemented. Requiring the use them, and the constraints you put on yourself just becomes a policy decision. You can enforce that policy with code auditing (use scripts to scan the source or object files and don't permit linking to the dynamic memory libraries)
I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
Keep work units in a contiguous block of RAM that will fit into the cache
To this end ;
Avoid objects where possible
Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
Work out the size of the cache on the machine you'll be executing it on
CPUs have different size L2 caches
It might be prudent to design your code to scale with different cache sizes
Or more simply, write code that will fit inside the lowest common cache size your code will be running on
Work out what needs to sit close to each datum
In practice, you're not going to fit your whole working set into the L2 cache
Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
I am no parallelism expert, but here is my advice anyway.
I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce#MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.
Or even heavily functional styles in non functional/non memory managed languages.
What sort of techniques are there to deal with problems like intermediate garbage? Cleaning up after lazynizess/thunk allocated memory. Performance(since you can't easily share resources between immutable variables if you have to track its progress to deallocate it(smart pointers?)
You might be interested in programming languages with linear or uniqueness types, these can manage resources (and memory in particular). Recent examples: ATS and LinearML.
There have been attempts at "region-based memory management" (e.g. Cyclone), but they haven't lifted off just yet -- regions also allow for (earlier) memory reclamation, but they aren't enough (e.g., there are programs which, when run with region-based memory management, will exhibit unacceptable performance). The two schemes could be mixed, I think.
Back to your question, some ATS programs can run without garbage collection. (I won't say that such programs are written in "functional" style, such as in SML, but in a mix of imperative and first-order functional style.)
The only relevant thing I can think of is how Mlton is eliminating a significant part of garbage collection with a region analysis. It should be possible, in theory, to implement a compiler which will treat an unmanageable and un-annotated pointer leak as an error, and then one would be able to use many functional programming techniques in an entirely manual memory management setting.
I was thinking on ways that functional languages could be more tied directly to their hardware and was wondering about any hardware implementations of garbage collection.
This would speed things up significantly as the hardware itself would implicitly handle all collection, rather than the runtime of some environment.
Is this what LISP Machines did? Has there been any further research into this idea? Is this too domain specific?
Thoughts? Objections? Discuss.
Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.
What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.
Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm
They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.
Another improvement would be hardware assisted write barriers for keeping track of remembered sets.
Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.
The problem is this means ANY time the mutator sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.
One obvious solution was to have memory pointers which are larger than your available RAM, for example, 34bit pointers on a 32 bit machine. Or use the uppermost 8 bits of a 32bit machine when you have only 16MB of RAM (2^24). The Oberon machines at the ETH Zurich used such a scheme with a lot success until RAM became too cheap. That was around 1994, so the idea is quite old.
This gives you a couple of bits where you can store object state (like "this is a new object" and "I just touched this object"). When doing the GC, prefer objects with "this is new" and avoid "just touched".
This might actually see a renaissance because no one has 2^64 bytes of RAM (= 2^67 bits; there are about 10^80 ~ 2^240 atoms in the universe, so it might not be possible to have that much RAM ever). This means you could use a couple of bits in todays machines if the VM can tell the OS how to map the memory.
Yes. Look at the related work sections of these 2 papers:
https://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/index.htm
http://www.filpizlo.com/papers/pizlo-ismm2007-stopless.pdf
Or at this one:
http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf
There was an article on lambda the ultimate describing how you need a GC-aware virtual memory manager to have a really efficient GC, and VM mapping is done mostly by hardware these days. Here you are :)
You're a grad student, sounds like a good topic for a research grant to me.
Look into FPGA design and computer architechture there are plenty of free processor design availiable on http://www.opencores.org/
Garbage collection can be implemented as a background task, it's already intended for parallel operation.
Pete
I'm pretty sure that some prototypes should exist. But develop, and produce hardware specific features is very expensive. It took very long time to implement MMU or TLB at a hardware level, which are quite easy to implement.
GC is too big to be efficiently implemented into hardware level.
Older sparc systems had tagged memory ( 33 bits) which was usable to mark addresses.
Not fitted today ?
This came from their LISP heritage IIRC.
One of my friends built a generational GC that tagged by stealing a bit from primitives. It worked better.
So, yes it can be done, but nodody bothers tagging things anymore.
runT1mes' comments about hardware assisted generational GC are worth reading.
given a sufficiently big gate array (vertex-III springs to mind) an enhanced MMU that supported GC activities would be possible.
Probably the most relevant piece of data needed here is, how much time (percentage of CPU time) is presently being spent on garbage collection? It may be a non-problem.
If we do go after this, we have to consider that the hardware is fooling with memory "behind our back". I guess this would be "another thread", in modern parlance. Some memory might be unavailable if it were being examined (maybe solvable with dual-port memory), and certainly if the HWGC is going to move stuff around, then it would have to lock out the other processes so it didn't interfere with them. And do it in a way that fits into the architecture and language(s) in use. So, yeah, domain specific.
Look what just appeared... in another post... Looking at java's GC log.
I gather the biggest problem is CPU registers and the stack. One of the things you have to do during GC is traverse all the pointers in your system, which means knowing what those pointers are. If one of those pointers is currently in a CPU register then you have to traverse that as well. Similarly if you have a pointer on the stack. So every stack frame has to have some sort of map saying what is a pointer and what isn't, and before you do any GC traversing you have to get any pointers out into memory.
You also run into problems with closures and continuations, because suddenly your stack stops being a simple LIFO structure.
The obvious way is to never hold pointers on the CPU stack or in registers. Instead you have each stack frame as an object pointing to its predecessor. But that kills performance.
Several great minds at MIT back in the 80s created the SCHEME-79 chip, which directly interpreted a dialect of Scheme, was designed with LISP DSLs, and had hardware-gc built in.
Why would it "speed things up"? What exactly should the hardware be doing?
It still has to traverse all references between objects, which means it has to run through a big chunk of data in main memory, which is the main reason why it takes time. What would you gain by this? Which particular operation is it that you think could be done faster with hardware support? Most of the work in a garbage collector is just following pointers/references, and occasionally copying live objects from one heap to another. how is this different from the instructions a CPU already supports?
With that said, why do you think we need faster garbage collection? A modern GC can already be very fast and efficient, to the point where it's basically a solved problem.