Is Tcl nested dictionary uses references to references, and avoids capacity issues? - dictionary

According to the thread:
TCL max size of array
Tcl cannot have >128M list/dictionary elements. However, One could have a nested dictionary, which total values (in different levels) exceeds the number.
Now, is the nested dictionary using references, by design? This would mean that as long as in 1 level of the dictionary tree, there is no more than 128M elements, you should be fine. Is that true?
Thanks.

The current limitation is that no individual memory object (C struct or array) can be larger than 2GB, and it's because the high-performance memory allocator (and a few other key APIs) uses a signed 32-bit integer for the size of memory chunk to allocate.
This wasn't a significant limitation on a 32-bit machine, where the OS itself would usually restrict you at about the time when you started to near that limit.
However, on a 64-bit machine it's possible to address much more, while at the same time the size of pointers is doubled, e.g., 2GB of space means about 256k elements for a list, since each needs at least one pointer to hold the reference for the value inside it. In addition, the reference counter system might well hit a limit in such a scheme, though that wouldn't be the problem here.
If you create a nested structure, the total number of leaf memory objects that can be held within it can be much larger, but you need to take great care to never get the string serialisation of the list or dictionary since that would probably hit the 2GB hard limit. If you're really handling very large numbers of values, you might want to consider using a database like SQLite as storage instead as that can be transparently backed by disk.
Fixing the problem is messy because it impacts a lot of the API and ABI, and creates a lot of wreckage in the process (plus a few subtle bugs if not done carefully, IIRC). We'll fix it in Tcl 9.0.

Related

Are global memory barriers required if only one work item reads and writes to memory

In my kernel, each work item has a reserved memory region in a buffer
that only it writes to and reads from.
Is it necessary to use memory barriers in this case?
EDIT:
I call mem_fence(CLK_GLOBAL_MEM_FENCE) before each write and before each read. Is this enough to guarantee load/store consistency?
Also, is this even necessary if only one work item is loading storing to this memory region ?
See this other stack overflow question:
In OpenCL, what does mem_fence() do, as opposed to barrier()?
The memory barriers work at a workgroup level, this is, stopping the threads belonging to the same block of threads until all of them reach the barrier. If there is not intersection between the memory spaces of different work items, there is not needed any extra synchronization point.
Also, is this even necessary if only one work item is loading storing to this memory region ?
Theoretically, mem_fence only guarantees the commit of the previous memory accesses before the later ones. In my case, I never saw differences in the results of applications using or not this mem_fence call.
Best regards

OpenCL performance: using arrays of primitives vs arrays of structures

I'm trying to use several arrays of doubles in a kernel that are all the same length. Instead of passing each double* in as a separate argument, I know I can define a structure in the .cl file that holds several doubles and then just pass into the kernel one pointer for an array of the structures instead.
Will the performance be different for the two ways? Please correct me if I am wrong, but I think passing individual double pointers means the access can be coalesced. Will accessing the structures also be coalesced?
As long as your structures don't contain any pointers what you say is absolutely possible. The primary impact is generally, as you've already considered, the effect this has on the coalescing of memory operations. How big an effect is down to your memory access pattern, the size of your struct and the device you're running on. More details would be needed to describe this more fully.
Saying that, one instance where I've used a struct in this way very successfully is where the element being read is the same for all work items in a work group. In this case there is no penalty on my hardware (nvidia GTX 570). Also it is worth remembering that in some cases the added latency introduced by the memory operation being serialised can be hidden. In the CUDA world this would be achieved by having a high occupancy for a problem with a high arithmetic intensity.
Finally it is worth pointing out that the semantic clarity of using a struct can have a benefit in and of itself. You'll have to consider this against any performance cost for your particular problem. My advice is to try it and see; it is very difficult to predict the impact of these issues ahead of time.
Theoretically it is the same performance. However if you access some of the members more often, using several segregatedvarrays will have much more performance, due to cpu cache locality. However most operations will be more difficult when you have several arrays.
The structures and the single elements will have the exact same performance.
Supose you have a big array of doubles, and the first work item uses 0, 100, 200, 300, ... and the next one uses 1, 101, 201, 301, ...
If you have a structure of 100 doubles, in memory the first structure will be first(0-99), then the second(100-199) and so on. The kernels will acess exactly the same memory and in the same places, the only difference is how you define the memory abstraction.
In a more generic case of having a structure of different element types (char, int, double, bool, ...) it may happen that the aligment is not as if it were a single array of data. But it will still be "semi-coalesced". I would even bet the performance is still the same.

High rate of Gen 1 garbage collections

I am profiling an application(using VS 2010) that is behaving badly in production. Once of the recommendations given by VS 2010 is:
Relatively high rate of Gen 1 garbage collections is occurring. If, by
design, most of your program's data structures are allocated and
persisted for a long time, this is not ordinarily a problem. However,
if this behavior is unintended, your app may be pinning objects. If
you are not certain, you can gather .NET memory allocation data and
object lifetime information to understand the pattern of memory
allocation your application uses.
Searching on google gives the following link=> http://msdn.microsoft.com/en-us/library/ee815714.aspx. Are there some obvious things that I can do to reduce this issue?I seem to be lost here.
Double-click the message in the Errors List window to navigate to the
Marks View of the profiling data. Find the .NET CLR Memory# of Gen 0
Collections and .NET CLR Memory# of Gen 1 Collections columns.
Determine if there are specific phases of program execution where
garbage collection is occurring more frequently. Compare these values
to the % Time in GC column to see if the pattern of managed memory
allocations is causing excessive memory management overhead.
To understand the application’s pattern of managed memory usage,
profile it again running a.NET Memory allocation profile and request
Object Lifetime measurements.
For information about how to improve garbage collection performance,
see Garbage Collector Basics and Performance Hints on the Microsoft
Web site. For information about the overhead of automatic garbage
collection, see Large Object Heap Uncovered.
The relevant line there is:
To understand the application’s pattern of managed memory usage, profile it again running a.NET Memory allocation profile and request Object Lifetime measurements.
You need to understand how many objects are being allocated by your application and when, and how long they are alive for. You're probably allocating hundreds (or thousands!) of tiny objects inside a loop somewhere without really thinking about the consequences of reclaiming that memory when the references fall out of scope.
http://msdn.microsoft.com/en-us/library/ms973837.aspx states:
Now that we have a basic model for how things are working, let's
consider some things that could go wrong that would make it slow. That
will give us a good idea what sorts of things we should try to avoid
to get the best performance out of the collector.
Too Many Allocations
This is really the most basic thing that can go wrong. Allocating new
memory with the garbage collector is really quite fast. As you can see
in Figure 2 above is all that needs to happen typically is for the
allocation pointer to get moved to create space for your new object on
the "allocated" side—it doesn't get much faster than that. However,
sooner or later a garbage collect has to happen and, all things being
equal, it's better for that to happen later than sooner. So you want
to make sure when you're creating new objects that it's really
necessary and appropriate to do so, even though creating just one is
fast.
This may sound like obvious advice, but actually it's remarkably easy
to forget that one little line of code you write could trigger a lot
of allocations. For example, suppose you're writing a comparison
function of some kind, and suppose that your objects have a keywords
field and that you want your comparison to be case insensitive on the
keywords in the order given. Now in this case you can't just compare
the entire keywords string, because the first keyword might be very
short. It would be tempting to use String.Split to break the keyword
string into pieces and then compare each piece in order using the
normal case-insensitive compare. Sounds great right?
Well, as it turns out doing it like that isn't such a good idea. You
see, String.Split is going to create an array of strings, which means
one new string object for every keyword originally in your keywords
string plus one more object for the array. Yikes! If we're doing this
in the context of a sort, that's a lot of comparisons and your
two-line comparison function is now creating a very large number of
temporary objects. Suddenly the garbage collector is going to be
working very hard on your behalf, and even with the cleverest
collection scheme there is just a lot of trash to clean up. Better to
write a comparison function that doesn't require the allocations at
all.

Asynchronous programs showing locality of reference?

I was reading this excellent article which gives an introduction to Asynchronous programming here http://krondo.com/blog/?p=1209 and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.

OutOfMemoryException

I have an application that is pretty memory hungry. It holds a large amount of data in some big arrays.
I have recently been noticing the occasional OutOfMemoryException. These OutOfMemoryExceptions are occurring long before my application (ASP.Net) has used up the 800mb available to it.
I have track the issue down to the area of code where the array is resized. The array contains a structure that is 74bytes in size. (I know that you shouldn't create struct's that are bigger than 16bytes), but this application is a port from a Vb6 application). I have tried changing the struct to a class and this appears to have fixed the problem for now.
I think the reason that changing to a class solves the problem has to do with the fact that when using a struct and the array is resized, a segment of memory that is large enough to store the new array needs to be reserved (e.g. (currentArraySize + increaseBySize)*74) cannot be found. This leads to the OutOfMemoryException.
This isn't the case with a class as each element of the array only needs 8bytes to store a pointer to the new object.
Is my thinking correct here?
Your assumptions regarding how arrays are stored are correct. Changing from struct to class will add a bit of overhead to each instance and you'll loose the advantages of locality as all data must be collected via a reference, but as you have observed it may solve your memory problem for now.
When you resize an array it will create a new one to hold the new data, then copy over the data, and you will have two copies of the same data in memory at the same time. Just as you expected.
When using structs the array will occupy the struct size * number of elements. When using a class it will only contain the pointer.
The same scenario is also true for List which increase in size over time, thus it's smart to initialize it with the expected number of items to avoid resizing and copying.
On 32bit systems you will hit outofmem around ~800mb, as you are aware of. One solution you can try is to but your structs on disk and read them when needed. Since they are a fixed size you can easily jump into the correct position at the file.
I have a project on Codeplex for handling large amounts of data. It has a type of Array with possibility for autogrowing, which might help your scenario if you run into problems with keeping it all in memory again.
The issue you are experiencing might be caused by fragmentation of the Large Object Heap rather than a normal out of memory condition where all memory really is used up.
See http://msdn.microsoft.com/en-us/magazine/cc534993.aspx
The solution might be as simple as growing the array by large fixed increments rather than smaller random increments so that as arrays are freed up the blocks of LOH memory can be reused for a new large array.
This may also explain the struct->class issue as the struct is likely stored in the array itself while the class will be a small object on the small object heap.
The .NET Framework 4.5.1, has the ability to explicitly compact the large object heap (LOH) during garbage collection.
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
See more info in: GCSettings.LargeObjectHeapCompactionMode
And a question about it: Large Object Heap Compaction, when is it good?

Resources