Trobleshooting slow writes on hpux - unix

Could anyone offer any troubleshooting ideas or pointers on where/how to get more information on the difference between sys and real time from the output below?
It is my understanding that the command finished processing in the OS in 4 seconds, but then IO where queued and processing and 38.3 seconds (is that right?). It is somewhat a block box at this point to me on how to get some additional details.
time prealloc /myfolder/testfile 2147483648
real 42.5
user 0.0
sys 4.2

You are writing 2 GB to disk on an HP-UX system; this is most likely using spinning disks (physical hard disks).
The system is writing 2GiB / 42s = 51 MB/s which doesn't sound slow to me.
On these systems you can use tools such as sar. Use sar -ud 5 to see CPU and disk usage during your prealloc command; you will likely see disk usage pegged at 100%.

Related

Nuodb Memory and CPU usage reached high

While accessing NuoDB database from Java application, In Task manager tool getting CPU and Memory usage reached 99% almost and I tired with NUODB 2.4 ,2.5 and 2.6 versions but finally i am getting same issue.
Present my windows server hardware configurations are below.
RAM : 12 GB (3 processors ) and
Hard disk : 100 GB
Please give any suggest to come this issue.
Thanks in advance
I see from the task manager MANY "NuoDB Server" processes running
(the picture shows 7 NuoDB processes running on that server),
It might be having too many TEs or too much memory configured for NuoDB on
that single server as a potential for the problem or setting NuoDB incorrectly.
The following link can help you understand how to check your system settings.
http://doc.nuodb.com/Latest/Default.htm#Mgr-Show-Domain.htm?Highlight=--memory

Why is direct output to network share much slower than inter-buffering?

This is an Arch Linux System where I mounted a network device over SSHFS (SFTP) using GVFS managed by Nemo FM. I'm using Handbrake to convert a video that lies on my SSD.
Observations:
If I encode the video using Handbrake and set the destination to a folder on the SSD, I get 100 FPS
If I copy a file from the SSD to the network share (without Handbrake), I get 3 MB/s
However, if I combine both (using Handbrake with the destination set to a folder on the network share), I get 15 FPS and 0.2 MB/s, both being significantly lower than the available capacities.
I suppose this is a buffering problem. But where does it reside? Is it Handbrake's fault, or perhaps GVFS caching not enough? Long story short, how can the available capacities be fully used in this situation?
When accessing the file over SFTP Handbrake will be requesting small portions of the file rather than the entire thing, meaning it is starting and finishing lots of transfers and adding that much more overhead.
Your best best for solving this issue is to transfer the ENTIRE file to the SSD before performing the encoding. 3 MB/s is slower than direct access to an older, large capacity mechanical drive and as such will not give you the performance you are looking for so direct access to a network share is not recommended unless you can speed up those transfers significantly.

Why is Datastax Opscenter eating too much CPU?

Environment :
machines : 2.1 xeon, 128 GB ram, 32 cpu
os : centos 7.2 15.11
cassandra version : 2.1.15
opscenter version : 5.2.5
3 keyspaces : Opscenter (3 tables), OpsCenter (10 tables), application`s keyspace with (485 tables)
2 Datacenters, 1 for cassandra (5 machines )and another one DCOPS to store up opscenter data (1 machine).
Right now the agents on the nodes consume on average ~ 1300 cpu (out of 3200 available). The only transactioned data being ~ 1500 w/s on the application keyspace.
Any relation between number tables and opscenter? Is it behaving alike, eating a lot of CPU because agents are trying to write the data from too many metrics or is it some kind of a bug!?
Note, same behaviour on previous version of opscenter 5.2.4. For this reason i first tried to upg opscenter to newest version available.
From opscenter 5.2.5 release notes :
"Fixed an issue with high CPU usage by agents on some cluster topologies. (OPSC-6045)"
Any help/opinion much appreciated.
Thank you.
Observing with the awesome tool you provided Chris, on specific agent`s PID noticed that the heap utilisation was constant above 90% and that triggered a lot of GC activity with huge GC pauses of almost 1 sec. In this period of time i suspect the pooling threads had to wait and block my cpu alot. Anyway i am not a specialist in this area.
I took the decision to enlarge the heap for the agent from default 128 to a nicer value of 512 and i saw that all the GC pressure went off and now any thread allocation is doing nicely.
Overall the cpu utilization dropped from values of 40-50% down to 1-2% for the opscenter agent. And i can live with 1-2% since i know for sure that the CPU is consumed by the jmx-metrics.
So my advice is to edit the file:
datastax-agent-env.sh
and alter the default 128 value of Xmx
-Xmx512M
save the file, restart the agent, and monitor for a while.
http://s000.tinyupload.com/?file_id=71546243887469218237
Thank you again Chris.
Hope this will help other folks.

Determine limiting factor of OpenCL workgroup size?

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.
What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.
EDIT
Since my previous answer was plainly wrong, I need more info on the problem.
By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.
Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_LOCAL_MEM_SIZE
CL_KERNEL_PRIVATE_MEM_SIZE
There might be more, but currently none come to mind.
General information:
A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).
So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.
CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)
Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.
After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.
What other factors (eg, registers/__private memory?) contribute to low
CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?
On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.
The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:
./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.
1 work registers used, 0 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 2 0 0 2 A
Shortest Path: 1 0 0 1 A
Longest Path: 1 0 0 1 A
Note: The cycles counts do not include possible stalls due to cache misses.
I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.

Out of Memory Exception - ASP.NET - IIS 7

The problem is with Memory management because I keep receiving “Out of Memory exception”.
Here are the scenarios where we face the problem:
Please note:
1. The site/application is developed in ASP.Net and uploaded on a server with the following specs:
- Windows Server 2008 (R2) Standard
- Intel Xeon L5520#2.27GHz 2.27GHz
- RAM = 8GB
- System Type = 64bit
The application is event management based web application where the requirements include saving huge amount of data in Sessions etc (mentioning this in case it is relevant)
The applications/site works fine until we:
Edit a file directly on the server
Update a file from repository
Copy/Paste a file (we don’t usually edit code using this technique)
Please note, all of the above hold true ONLY when the traffic to the site is high that is,
The issue/error “Out of Memory” is not produced when the traffic/visits is low
Details of:
System Properties > Advanced > Performance Settings > Advanced tab
Total paging file size for all drives: 16362 MB
In web.config
Is there any way we can debug this problem to the core and find out a solution. Can you please provide links/help where we can further investigate this problem?
Best regards,
Farrukh
Out of Memory Exceptions are common with applications that see periodic transaction surges while keeping larger volumes of data in memory. This problem does, however, depend on your application and architecture. Below are a few pointers:
Hardware - you have Xeon 5500 (Intel Nehalem chips). These are very good at handling memory. You should be good here.
OS - Windows Server 2008 R2 - As an OS this system will handle more than enough memory for you (you are good here, see link for capabilities: Memory Limits for Windows)
Physical Memory - Did you say you have 8 GB on the server? Note you app is allowing 16 GB. There is one issue. If your app requests more memory than physically available you will see your error. But this is not your only concern ...
CLR / GC limitations - Your application has a "paging file size" of 16+ GB. This is probably your issue.
GC is the heart of your problem for you. In terms of why, it is the same reason Java and the JVM have issues whenever an application exceeds 2-4 GB. That requires a look at the actual process of GC.
You have "old generation" and "young generation" Garbage Collection processes. As you app runs the CLR tries to keep your memory space organized. These processes force all threads to pause (phase changes) when GC mark and swap processes occur. The problem here is, depending on how your code is written and the amount of memory you keep around for long periods, you can run into memory issues.
Any time you press a runtime environment to exceed the 4 GB threshold you will see exponential increases in collection times. When you hit the "stop the world" pause (the old gen GC where everything gets cleaned up) the CLR has to go through the entire heap and de-allocate memory. Based on your app, 16 GB may give you issues even with more physical memory (Windows Server 2008 R2 - Enterprise or DataCenter can support 2 TB). Even if you feed it more physical memory you may see LONG collection times when your full GC hits.
Ideally I would do the following:
Get more physical memory (you never want to come withing 600MB of your total physical memory allocated to your application to avoid out of memory errors, but your buffer does depend on your load and the application's ability to handle it ... you may want a larger safety net to be safe).
Once you have the physical memory you need run GC logs while stressing the app. This will give you an idea where you see exponential degradation in performance and what level your app can support when considering Heap size (Memory). You may want to find a way to get your 16GB page down to a smaller size. I do know with .Net 4.0 Microsoft has made some solid improvements to the GC process, including allowing a background thread to maintain GC. This should give you the ability to support larger heaps (in theory) ... but nothing beats real tests on the app. Check out this link for more info:
Garbage Collection Performance (Asp.net 4.0) - Also, as I am limited on links. Navigate to the Fundamentals page for some great explanations on new GC features of ASP.Net 4.0
(http://msdn.microsoft.com/en-us/library/ee787088.aspx#concurrent_garbage_collection)
Hope this helps!
PS - Anyone out there on lesser hardware will need to be aware of the ASP.NET use of the GC thread. If you are running something in development like a Core Duo you have to consider that 50% of your compute power will go to GC optimization. This means that Hardware (number of cores) is important to consider. If you have more than you need this process should theoretically help performance. If you are constrained on cores either get better hardware or use an older version of ASP.Net or consider turning the feature off (if possible). Second, if latency is a concern, using "hyper-threading" does have an impact on performance as well. You always get better performance on "physical" cores ... but that will not be a concern for 99.9% of the applications out there.
2 GB by default. If the application is large address space aware (linked with /LARGEADDRESSAWARE), it gets 4 GB (see http://msdn.microsoft.com/en-us/library/aa366778.aspx)
They're still limited to 2 GB since many application depends on the top bit of pointers to be zero.

Resources