A detail about SGX loading

A detail about SGX loading - intel

Is it possible to load a program larger than the EPC memory to an enclave? I feel like in theory it is permissible because
OS can swap pages out freely
EEXTEND measures an enclave incrementally by 256 bytes
So in theory, it seems possible to load a big program using just one page of EPC memory:
load 4K bytes to an EPC page
measure the loaded page
evict the loaded page
load the next 4K bytes to the same EPC page as the one in (1)
Am I understanding correctly in theory? Although in practice, I got an error immediately when loading big programs.

I asked a similar question in the Intel forums. The summary [1] is helpful.
The short answer: No, you cannot at this time load an enclave that is larger than the EPC.
Due to the current lack of paging support (and lack of dynamic page allocation that v2 will provide) this means that the combined HeapMaxSize of all enclaves loaded at the same time cannot exceed said ~90MB. [1]
The long answer: In SGX there are two mechanisms of dynamic memory management:
an enclave can request additional pages via EAUG - this is only supported in SGXv2, for which no hardware is currently available
the OS could swap out EPC pages to regular RAM (EWB/ELD instructions), but Windows does not currently support this
So why can you not load an enclave larger than EPC?
the EPC size is limited on current systems to roughly 90MB
Windows does not currently support swapping out these pages
an enclave must request all pages it wishes to use before executing (EINIT) on SGXv1 hardware
the size of all enclaves must not exceed the EPC size
Intel reserves some EPC space for their management enclaves (quoting, provisioning, loading enclaves)
So your enclave will have to use well below 90MB of heap size on current hardware. I have experimented with the SDK emulation, and found that it allows a heap max size of roughly 1GiB [2]. Future OS versions will hopefully support EPC page swapping, allowing larger static enclave sizes. Future SGX hardware will allow dynamic page allocation, allowing dynamic enclave sizes.
[1] https://software.intel.com/en-us/forums/intel-isa-extensions/topic/607004#comment-1857071
[2] 1GiB - 64KiB - TCSnum * 128KiB, where TCSnum is the number of threads. Exceeding this HeapMaxSize results in a simulation error

Researcher here, working with Intel SGX.
I would just like to add that Linux, however, does support mechanism 2) mentioned above, allowing pages to be encrypted and swapped out to regular DRAM.
What this effectively means is yes to your original question. Linux is able to create enclaves of arbitrary size. However, in its current form(v1) once the enclave is finalized the size may not expand.
As to whether this is a good idea, the answer is definitely no. Expanding enclaves above the size of the EPC causes a lot of costly pagefaults to occur degrading performance significantly.

Related

SSE 4 memory load optimizations

When using SSE instructions/intrinsics, say for 256-bit registers, has anyone been able to reduce time spent loading the extended registers from memory by using either the prefetch instruction on the next 32-byte chunk, or by some other technique? Assume the data to be loaded is already properly aligned in memory.

See the x86 tag wiki for more info about x86 CPU performance. Hardware prefetchers are pretty good at locking onto patterns of sequential access, so you don't usually need software prefetch instructions.
Usually it's not a win to do a wide vector load an unpack it into separate integer registers. Once you've touched a cache line, more loads from it are cheap, and throughput from L1 cache into registers isn't usually the problem. Using ALU instruction to unpack a 256b load into separate 32 or 64b integers just takes more instructions and means you're more likely to bottleneck on ALU throughput.

Determine limiting factor of OpenCL workgroup size?

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.
What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.

EDIT
Since my previous answer was plainly wrong, I need more info on the problem.
By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.
Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_LOCAL_MEM_SIZE
CL_KERNEL_PRIVATE_MEM_SIZE
There might be more, but currently none come to mind.
General information:
A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).
So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.
CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)
Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.
After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.

What other factors (eg, registers/__private memory?) contribute to low
CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?
On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.
The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:
./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.
1 work registers used, 0 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 2 0 0 2 A
Shortest Path: 1 0 0 1 A
Longest Path: 1 0 0 1 A
Note: The cycles counts do not include possible stalls due to cache misses.
I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.

Out of Memory Exception - ASP.NET - IIS 7

The problem is with Memory management because I keep receiving “Out of Memory exception”.
Here are the scenarios where we face the problem:
Please note:
1. The site/application is developed in ASP.Net and uploaded on a server with the following specs:
- Windows Server 2008 (R2) Standard
- Intel Xeon L5520#2.27GHz 2.27GHz
- RAM = 8GB
- System Type = 64bit
The application is event management based web application where the requirements include saving huge amount of data in Sessions etc (mentioning this in case it is relevant)
The applications/site works fine until we:
Edit a file directly on the server
Update a file from repository
Copy/Paste a file (we don’t usually edit code using this technique)
Please note, all of the above hold true ONLY when the traffic to the site is high that is,
The issue/error “Out of Memory” is not produced when the traffic/visits is low
Details of:
System Properties > Advanced > Performance Settings > Advanced tab
Total paging file size for all drives: 16362 MB
In web.config
Is there any way we can debug this problem to the core and find out a solution. Can you please provide links/help where we can further investigate this problem?
Best regards,
Farrukh

Out of Memory Exceptions are common with applications that see periodic transaction surges while keeping larger volumes of data in memory. This problem does, however, depend on your application and architecture. Below are a few pointers:
Hardware - you have Xeon 5500 (Intel Nehalem chips). These are very good at handling memory. You should be good here.
OS - Windows Server 2008 R2 - As an OS this system will handle more than enough memory for you (you are good here, see link for capabilities: Memory Limits for Windows)
Physical Memory - Did you say you have 8 GB on the server? Note you app is allowing 16 GB. There is one issue. If your app requests more memory than physically available you will see your error. But this is not your only concern ...
CLR / GC limitations - Your application has a "paging file size" of 16+ GB. This is probably your issue.
GC is the heart of your problem for you. In terms of why, it is the same reason Java and the JVM have issues whenever an application exceeds 2-4 GB. That requires a look at the actual process of GC.
You have "old generation" and "young generation" Garbage Collection processes. As you app runs the CLR tries to keep your memory space organized. These processes force all threads to pause (phase changes) when GC mark and swap processes occur. The problem here is, depending on how your code is written and the amount of memory you keep around for long periods, you can run into memory issues.
Any time you press a runtime environment to exceed the 4 GB threshold you will see exponential increases in collection times. When you hit the "stop the world" pause (the old gen GC where everything gets cleaned up) the CLR has to go through the entire heap and de-allocate memory. Based on your app, 16 GB may give you issues even with more physical memory (Windows Server 2008 R2 - Enterprise or DataCenter can support 2 TB). Even if you feed it more physical memory you may see LONG collection times when your full GC hits.
Ideally I would do the following:
Get more physical memory (you never want to come withing 600MB of your total physical memory allocated to your application to avoid out of memory errors, but your buffer does depend on your load and the application's ability to handle it ... you may want a larger safety net to be safe).
Once you have the physical memory you need run GC logs while stressing the app. This will give you an idea where you see exponential degradation in performance and what level your app can support when considering Heap size (Memory). You may want to find a way to get your 16GB page down to a smaller size. I do know with .Net 4.0 Microsoft has made some solid improvements to the GC process, including allowing a background thread to maintain GC. This should give you the ability to support larger heaps (in theory) ... but nothing beats real tests on the app. Check out this link for more info:
Garbage Collection Performance (Asp.net 4.0) - Also, as I am limited on links. Navigate to the Fundamentals page for some great explanations on new GC features of ASP.Net 4.0
(http://msdn.microsoft.com/en-us/library/ee787088.aspx#concurrent_garbage_collection)
Hope this helps!
PS - Anyone out there on lesser hardware will need to be aware of the ASP.NET use of the GC thread. If you are running something in development like a Core Duo you have to consider that 50% of your compute power will go to GC optimization. This means that Hardware (number of cores) is important to consider. If you have more than you need this process should theoretically help performance. If you are constrained on cores either get better hardware or use an older version of ASP.Net or consider turning the feature off (if possible). Second, if latency is a concern, using "hyper-threading" does have an impact on performance as well. You always get better performance on "physical" cores ... but that will not be a concern for 99.9% of the applications out there.

2 GB by default. If the application is large address space aware (linked with /LARGEADDRESSAWARE), it gets 4 GB (see http://msdn.microsoft.com/en-us/library/aa366778.aspx)
They're still limited to 2 GB since many application depends on the top bit of pointers to be zero.

Flash Total Memory Usage and TaskManager Memory Usage are diffrent?

Hi I wrote an application in flash AS3, and when I trace from flash the total memory usage of the total application is only about 9MB, But at the same time Task Manager Shows the memory usage as 110MB. Around 100MB difference.
Flash Trace Method System.totalMemory difference of the Trace from the Beginning of the application to end of the application.

The amount of memory used by the flash player isn't necessarily related to how much memory your application is using. The players memory usage depends on how much memory the os gives it and a number of other things, if you have plenty of free memory there's no reason not to have the flash player sit on some for when it's needed.
All in all, you only need to worry about the actual memory usage reported by System.totalMemory*
* But do note that it reports the memory used for all currently running flash apps

Internally Flash Player will garbage collect making System.totalMemory accurate for internal usage. But even when the memory is GC'd it isn't given right back to the system. In IE you can cause the browser to give the GC'd space back by minimizing the browser. So essentially the value you see in the Task Manager is a high water mark of memory usage. If you need that value to be lower then the only thing you can do is use less memory. For example, before loading / creating something new, wait until something else has been GC'd so that Flash Player doesn't allocate new memory for an instant. The challenge is in knowing when something has been actually GC'd. There isn't a good way to do that.

What is the highest number of threads that is reasonable to simultaneously run in Jmeter?

I want to use the highest possible number of threads (to use less computers) but without making the bottleneck to be in the client.

JMeter can simulate a very High Load provided you use it right.
Don't listen to Urban Legends that say JMeter cannot handle high load.
Now as for answer, it depends on:
your machine power
your jvm 32 bits or 64 bits
your jvm allocated memory -Xmx
your test plan ( lot of beanshell, post processor, xpath ... Means lots of cpu)
your os configuration (tunable)
Gui / non gui mode
So there is no theorical answer but following Best Practices will ensure JMeter performs well.
Note that with jmeter you can distribute load through remote testing, read:
Remote Testing > 15.4 Using a different sample sender
And finally use cloud based testing if it's not enough.
Read this for tuning tips:
http://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
Read this book for doing load testing and using JMeter correctly.

I have used JMeter a fair bit and found it is not great at generating really high load. On a 2Ghz Core2 Duo with 2Gb memory you can reasonably expect about 100 threads.
That being said, it is best to run it on your hardware so that the CPU of the PC does not peak at 100% - a stable 80%-90% is best otherwise the results are affected.
I have also tried WAPT 5 - it successfully ran 1000+ threads from the same PC. It is not free but it is more useable than JMeter but doesn't have all of the features.
Outdated answer since at least version 2.6 see https://stackoverflow.com/a/11922239/460802 for a more up to date one.

The JMeter Wiki reports cases where JMeter was used with as much as 1000 threads. I have used it with at most 100 threads, but the Links in the Wiki suggest resource reductions I never tried.

One of the issues we had with running JMeter on Windows XP was the Windows XP TCP Connection Limit. Limit should be removed in order to run use the JMeter to workstation’s full potential
More info here. AFAIK, does not apply to other OS.

I used JMeter since 2004 and i launched lot of load tests.
With PC Windows 7 64 bits 4Go RAM iCore5.
I think JMeter can support 300 to 400 concurrent threads for Http (Sampler) protocol with only one "Aggregate Report Listener" who writes in the log file results and timers between call pages.
For a big load test you could configure JMeter with slaves (load generators) like this
http://jmeter-plugins.org/wiki/HttpSimpleTableServer/
I have already done tests with 11 PC slaves to simulate 5000 threads.

I have not used JMeter, but the answer probably depends on your hardware. Best bet might be to establish metrics of performance, guess at the number of threads and then run a binary search as follows.
Source was Wikipedia.
Number guessing game...
This rather simple game begins something like "I'm thinking of an integer between forty and sixty inclusive, and to your guesses I'll respond 'High', 'Low', or 'Yes!' as might be the case." Supposing that N is the number of possible values (here, twenty-one as "inclusive" was stated), then at most questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is already constrained to be within a particular range.
Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it: 1, 2, 4, 8, 16, 12, 10, 11
One could also extend the technique to include negative numbers; for example the following guesses could be used to find −13: 0, −1, −2, −4, −8, −16, −12, −14, −13

It is more dependent on the kind of performance testing you do(load, spike, endurance etc) on a specific server (a little on hardware dependency)
Keep in mind around these parameters
- the client machine on which you are targeting the run of jmeter, there will be a certain amount of heap memory allocated, ensure to have a healthy allocation so that the script does not error out. The highest i had run on jmeter was 1500 on a local environment ( client - server arch), On a Web arch, the highest i had a run was based upon Non- functional requirement were limited to 250 threads,
so it ideally depends on the kinds of performance testing and deployment style and so on..

There is not standard number for this. The maximum number of threads that you can generate from one computer depends completely on the computer's hardware and the OS. The OS by default occupies certain amount of CPU and the RAM.
To find out the maximum threads your computer can handle you can prepare a sample test and run it with only a few threads. Then with each cycle of test run increase the number of threads gradually. During this you also need to monitor the CPU, RAM, Disk I/O and Network I/O of your computer. The moment any of these reach near or beyond 80% (Again for you to decide if near is okay for you or beyond), that is the maximum number of threads your computer can handle. To be on the safer side I would stop at the number when the resource utilization reaches 70%.

It'll depend on the hardware you run on as well as the underlying script. I've always felt that this fuzziness is the biggest problem with traditional load testing tools. If you've got a small budget ($200 or so gets you a LOT of testing), check out my company's load testing service, BrowserMob.
Besides our Real Browser Users (RBUs) which control thousands on actual browsers for the purpose of performance and load testing, we also have traditional virtual users (VUs). Scripts are written in JavaScript and can make various HTTP calls.
The reason I bring it up is that I always felt that the game of trying to figure out how many VUs you can fit on your load gen hardware is dangerous. It's so easy to get bad results without realizing it.
To solve that for BrowserMob, we took an extremely conservative approach on the number of VUs and RBUs per CPU core: no more than 1 browser or 50 threads per CPU core, and sometimes much less. In the world of cloud computing, CPU cycles are so cheap that it just doesn't make sense to try to overload machines.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex