In the Instruments under Allocations, for the "All Allocations" line the "Live Bytes" and "Overall Bytes" show a very small amount of memory (~2.5MB) and they seem to be very stable. But the "# Living" and "# Overall" keep going up gradually.
Question: Which columns are more important in term of finding out memory footprint for my app? What are differences between "Live Bytes" vs "# Living" and "Overall Bytes" vs "# Overall"?
BTW: Instruments shows no leaks memory at all.
Thank you.
The Live Bytes column for the All Allocations category is the best estimate of your app's memory footprint.
The Live Bytes column tells you the amount of currently allocated memory for a given category. The # Living column tells you the number of net memory allocations for a given category. The Overall Bytes column tells you the total amount of allocated memory. The # Overall column tells you the total number of memory allocations.
If you use the Leaks template, the Allocations instrument is configured to track only active memory allocations. When you track only active memory allocations, the Live Bytes and Overall Bytes columns are going to be the same, and the # Living and # Overall columns are going to be the same. Clicking the Info button next to the Allocations instrument lets you configure what the Allocations instrument records.
Related
My Codename One application downloads around 16000 records of data (approx 10 fields in each record).
On my Android phone (OS6.0, RAM 2GB) it's able to load 8000 to 9000 records but then shows out of memory error.
From the trace, it looks like it run out of heap memory allocated to the app.
Any suggestion what would be the ideal way to handle that large amount of data, please?
Here is the log file
The amount of RAM on the phone doesn't mean much. The OS takes about half and then divides the rest to the various apps running in parallel. You would typically have much less see What is the maximum amount of RAM an app can use?
You need to review your code and check what is eating up memory. 16k records of 1kb each would be 16Mb which probably shouldn't crash an app so the question is where is memory taken, I would suggest reading the performance section of the developer guide to figure out memory usage.
This might not apply to your situation, but would it be possible to only download x number of records at a time? Then, when the user takes some action (scrolls, hits next page, etc) it loads the next batch? Codename one has a great endless scroller implementation. See here for an example - https://www.codenameone.com/blog/property-cross-revisited.html
How to calculate RCU and WCU with the data given as: reading throughput of 32 GB/s and writing throughput of 16 GB/s.
DynamoDB Provisioned Throughput is based upon a certain size of units, and the number of items being written:
In DynamoDB, you specify provisioned throughput requirements in terms of capacity units. Use the following guidelines to determine your provisioned throughput:
One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for items up to 4 KB in size. If you need to read an item that is larger than 4 KB, DynamoDB will need to consume additional read capacity units. The total number of read capacity units required depends on the item size, and whether you want an eventually consistent or strongly consistent read.
One write capacity unit represents one write per second for items up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB will need to consume additional write capacity units. The total number of write capacity units required depends on the item size.
Therefore, when determining your desired capacity, you need to know how many items you wish to read and write per second, and the size of those items.
Rather than seeking a particular GB/s, you should be seeking a given number of items that you wish to read/write per second. That is the functionality that your application would require to meet operational performance.
There are also some DynamoDB limits that would apply, but these can be changed upon request:
US East (N. Virginia) Region:
Per table – 40,000 read capacity units and 40,000 write capacity units
Per account – 80,000 read capacity units and 80,000 write capacity units
All Other Regions:
Per table – 10,000 read capacity units and 10,000 write capacity units
Per account – 20,000 read capacity units and 20,000 write capacity units
At 40,000 read capacity units x 4KB x 2 (eventually consistent) = 320MB/s
If my calculations are correct, your requirements are 100x this amount, so it would appear that DynamoDB is not an appropriate solution for such high throughputs.
Are your speeds correct?
Then comes the question of how you are generating so much data per second. A full-duplex 10GFC fiber runs at 2550MB/s, so you would need multiple fiber connections to transmit such data if it is going into/out of the AWS cloud.
Even 10Gb Ethernet only provides 10Gbit/s, so transferring 32GB would require 28 seconds -- and that's to transmit one second of data!
Bottom line: Your data requirements are super high. Are you sure they are realistic?
if you click on capacity tab of your dynamodb table there is a capacity calcuator link next to Estimated cost. you can use that to determine the read and write capacity units along with estimated cost.
read capacity units are dependent on the type of read that you need (strongly consistent/eventually consistent), item size and throughput that you desire.
write capacity units are determined by throughput and item size only.
for calculating item size you can refer this and below is a screenshot of the calculator
I am able to list the following parameters which help in restricting the work items for a device based on the device memory:
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
I find the explanation for these parameters insufficient and hence I am not able to use these parameters properly.
Can somebody please tell me what these parameters mean and how they are used.
Is it necessary to check all these parameters?
PS: I have some brief understanding of some of the parameters but I am not sure whether my understanding is correct.
CL_DEVICE_GLOBAL_MEM_SIZE:
Global memory amount of the device. You typically don't care, unless you use high amount of data. Anyway the OpenCL spec will complain about OUT_OF_RESOURCES error if you use more than allowed. (bytes)
CL_DEVICE_LOCAL_MEM_SIZE:
Amount of local memory for each workgroup. However, this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. So take it as a maximum available amount per WG.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:
The maximum amount of constant memory that can be used for a single kernel. If you use constant buffers that all together have more than this amount, either it will fail, or use global normal memory instead (it may therefore be slower). (bytes)
CL_DEVICE_MAX_MEM_ALLOC_SIZE:
The maximum amount of memory in 1 single piece you can allocate in a device. (bytes)
CL_DEVICE_MAX_WORK_GROUP_SIZE:
Maximum work group size of the device. This is the ideal maximum. Depending on the kernel code the limit may be lower.
CL_DEVICE_MAX_WORK_ITEM_SIZES:
The maximum amount of work items per dimension. IE: The device may have 1024 WI as maximum size and 3 maximum dimensions. But you may not be able to use (1024,1,1) as size, since it may be limited to (64,64,64), so, you can only do (64,2,8) for example.
CL_KERNEL_WORK_GROUP_SIZE:
The default kernel size given by the implementation. It may be forced to be higher, or lower, but the value already provided should be a good one already (good tradeoff of GPU usage %, memory spill off, etc).
NOTE: All this data is the theoretical limits. But if your kernel uses a resource more than other, ie: local memory depending on the size of the work group, you may not be able to reach the maximum work items per work group, since it is possible you reach first the local memory limit.
Each GPU device (AMD, NVidea, or any other) is split into several Compute Units (MultiProcessors), each of which has a fixed number of cores (VertexShaders/StreamProcessors). So, one has (Compute Units) x (VertexShaders/compute unit) simultaneous processors to compute with, but there is only a small fixed amount of __local memory (usually 16KB or 32KB) available per MultiProcessor. Hence, the exact number of these multiprocessors matters.
Now my questions:
(a) How can I know the number of multiprocessors on a device? Is this the same as CL_DEVICE_MAX_COMPUTE_UNITS? Can I deduce it from specification sheets such as http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units?
(b) How can I know how much __local memory per MP there is available on a GPU before buying it? Of course I can request CL_DEVICE_LOCAL_MEM_SIZE on a computer that runs it, but I don't see how I can deduce it from even an individual detailed specifications sheet such as http://www.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#3?
(c) What is the card with currently the largest CL_DEVICE_LOCAL_MEM_SIZE? Price doesn't really matter, but 64KB (or larger) would give a clear benefit for the application I'm writing, since my algorithm is completely parallelizable, but also highly memory-intensive with random access pattern within each MP (iterating over edges of graphs).
CL_DEVICE_MAX_COMPUTE_UNITS should give you the number of ComputeUnits, otherwise you can glance it from appropriate manuals (the AMD opencl programming guide and the Nvidia OpenCL programming guide)
The linked guide for AMD contains information about the availible local memory per compute unit (generally 32kB / CU). For NVIDIA a quick google search revealed this document, which gives the local memory size as 16kB/CU for G80 and G200 based GPUs. For fermi based cards (GF100) there are 64kB of onchip memory availible, which can be configured as either 48kB local memory and 16kB L1 cache or 16kB local memory and 48kB L1 cache. Furthermore fermi based cards have an L2 cache of upto 768kB (768kB for GF100 and GF110, 512kB for GF104 and GF114 and 384kB for GF106 and GF116, none for GF108 and GF118 according to wikipedia).
From the informations above it would seem that current nvidia cards have the most local memory per compute unit. Furthermore it is the only one with a general L2 Cache from my understanding.
For your usage of local memory you should however remember that local memory is allocated per workgroup (and only accessible for a workgroup), while a Compute Unit can typically sustain more then one workgroup. So if your algorithm allocated the whole local memory to one workgroup you will not be able to use achieve the maximum amount of parallelity. Also note that since local memory is banked random access will lead to alot of bank conflicts and warp serializations. So your algorithm might not parallize quite as good as you think it will (or maybe it will, just mentioning the possibility).
With a Fermi based card your best bet might be to count on the caches instead of explicit local memory, if all your workgroups operate on the same data (I don't know how to switch the L1/local Memory configuration though).
I have an ASP.NET / C# web application that is using a lot of memory.
ANTS Memory Profiler and PerfMon both show my Gen 0 heap growing rapidly to about 1 GB in size during Application_Start. I read here that the PerfMon counter for Gen 0 Heap is actually showing the "budget" for Gen 0, not the size (which I take to mean not all that memory is part of the private working set of the process?). However the ANTS profiler does show about 700 MB of "unused memory allocated to .NET", and this does seem to be part of the private working set of the process (as reported in taskmgr). I am guessing this large amount of unused memory is related to the large Gen 0 heap.
What is happening in Application_Start while this happens is that I'm in a while loop reading about a million rows from a SqlDataReader. These are being used to populate a large cache for later use. Given this, the obvious culprit for the large amount of unused memory was large object heap fragmentation, but I don't think this is the case as I'm pre-allocating more than is needed for my large cache object. To be certain, I even tried commenting out the part of the loop that actually adds to my cache object; it made no difference in the amount of unused memory allocated.
As a test, I tried frequently forcing garbage collection of gen 0 during the loop (against all recommendations, I know), and this caused the size of the gen0 heap to stay down around 128 MB and also resulted in only a few MB of unused free memory. But it also maxed out my CPU and made Application_Start take way too long.
My questions are:
1) What can cause the reported size of the Gen 0 Heap to grow so large?
2) Is this a problem? In particular, could it be causing a large amount of unused space to be allocated to .NET?
3) If so, what should I do to fix it? If I can't prevent the process from using that much memory during Application_Start, I'd like to at least be able to make it give up the memory when app start completes.
Gen 0 contains the "the youngest, most recently allocated objects", and is separate and distinct from the LOH. What it sounds like is that you are allocating tons and tons of small objects (all of the errata associated with those cache entries, by the sound of it) which are unrooted (demonstrated by the frequent GC's keeping the size down) but not cleaned up in a timely manner because the GC hasn't deemed it necessary yet. The GC simply has not yet seen a need to clean up. How much RAM does the machine have? I am guessing that you are not paging.
My understanding is that .NET can return chunks of unused heap to the OS when the GC deems it is no longer needed (because that heap space hasn't been used for a long time, for example). Have you observed the app over a long period of time, to see whether this happens? If you are not paging, I don't think it is a problem.