How to modify a code acceding to a function? - r

I will try to explain my problem. There are 365 (global map)files in two directories dir1 and dir2, which have the same format ,byte,extend,etc. I computed the bias between two datasets using the function and code given below as follows:
How can I solve this problem?please

I suspect this is due to memory limitations on a 32-bit system. You want to allocate an array of 933M doubles, that requires 7.6Gb of continuous memory. I suggest you to read ?Memory and ?"Memory-limits" for more details. In particular, the latter says:
Error messages beginning ‘cannot allocate vector of size’ indicate
a failure to obtain memory, either because the size exceeded the
address-space limit for a process or, more likely, because the
system was unable to provide the memory. Note that on a 32-bit
build there may well be enough free memory available, but not a
large enough contiguous block of address space into which to map
it.
If this is indeed your problem, you may look into bigmemory package (http://cran.r-project.org/web/packages/bigmemory/index.html) which allows to manage massive matrixes with shared and file-based memory. There are also other strategies (e.g. using an SQLite database) to manage data that doesn't fit in memory all at once.
Update. Here is an excerpt from Memory-limit for Windows:
The address-space limit is 2Gb under 32-bit Windows unless the OS's default has been changed to allow more (up to 3Gb). See http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx and http://msdn.microsoft.com/en-us/library/bb613473(VS.85).aspx. Under most 64-bit versions of Windows the limit for a 32-bit build of R is 4Gb: for the oldest ones it is 2Gb. The limit for a 64-bit build of R (imposed by the OS) is 8Tb.
It is not normally possible to allocate as much as 2Gb to a single vector in a 32-bit build of R even on 64-bit Windows because of preallocations by Windows in the middle of the address space.
Under Windows, R imposes limits on the total memory allocation available to a single session as the OS provides no way to do so: see memory.size and memory.limit.

Related

RISC-V 32/64-bit compatibility issues

Suppose you take an RV32 program and try running it on a 64-bit system, what compatibility issues are likely to arise?
As I understand it, the instruction encoding is the same, and on RISC-V (like other modern RISC architectures, though unlike x86), ALU operations automatically operate on whatever the word size is, so if you add the contents of a pair of registers, you will get a 32-bit or 64-bit addition as applicable. Load and store of course work on an explicitly specified size because they depend on how many bytes have been allocated in memory.
One theoretically possible compatibility issue would arise if the code depends on bits past 32 being discarded, e.g. add 2^31 to itself and compare the result with zero.
Another more practical issue would arise if the operating system supplies memory addresses outside the first 4 gigabytes, which would be garbled when the code stores the addresses in 32-bit variables.
Are there any other issues I am missing?
You are correct about both of those possible compatibility issues.
Additionally, some Control and Status Registers (namely cycleh, instreth, timeh) are not needed in RV64I and therefore don't exist. Any code which tries to access them should error.
However, there are instructions to use only the lower 32 bits for ALU operations. Which could potentially be changed by replacing the opcode and funct3 in the binary.
So with an operating system mode which returns only 32-bit addresses, it would be possible to replace a binary with a working 64 bit version so long as cycleh and friends aren't used.
References RISC-V Specification v2.2:
Chapter 4 of the RISC-V Spec. v2.2 outlines the differences from RV32I to RV64I.
Chapter 2.8 goes over Control and Status Registers
Table 19.3 lists all of the CSRs in the standard.

Force Vagrant to use Swap memory

I have one of those first alu iMacs with 2+2 GB ram. I use Vagrant to emulate advanced development environments, separated for different jobs.
When I have just one vagrant process running in the background, the computer gets to be slow as hell, because it is always out of memory.
The question is: can I use vagrant (or any app) to run only on swap memory, so it leaves all the memory for the os and other apps?
If there is any solution, how can I do that?
The short answer is: No, a process can not run in swap completely.
Processes must have their data in RAM for the CPU to be able to operate on it, infrequently used data is moved out to swap space when there's no longer space available in memory for everything that's loaded.
You could create a larger swap space and use ulimit to limit the amount of memory used by processes (i.e. force them into swap earlier), but this doesn't really address the root of your problem - that you're pretty much at the limit of your 4GB of memory.
Keep in mind that using swap space will always produce performance problems as (even with SSDs) reading from disk is far slower than reading from memory.
Short of upgrading to more memory, you could:
Reduce the amount of memory allocated by your vagrant box;
Use OS X's Activity Monitor to identify and close any programs/processes that are not in use but are still using memory.
but, again, these are just stop-gap solutions.
Simple answer is no.
Control swappiness has to be done within the VM, for example Linux, echo 100 > /proc/sys/vm/swappiness to set swap strategy to most aggressive mode. Remember, you have no control over where processes are running (physical memory VS swap)
However, by doing this, your host/guest will still be slow as hell as simply you don't have enough physical memory.
The ultimate solution is to add more RAM to your iMAC ;-D

Determine limiting factor of OpenCL workgroup size?

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.
Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.
What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.
EDIT:
The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.
If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.
EDIT
Since my previous answer was plainly wrong, I need more info on the problem.
By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.
Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_LOCAL_MEM_SIZE
CL_KERNEL_PRIVATE_MEM_SIZE
There might be more, but currently none come to mind.
General information:
A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).
So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.
CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)
Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.
After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.
What other factors (eg, registers/__private memory?) contribute to low
CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?
On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.
The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:
./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.
1 work registers used, 0 uniform registers used, spilling not used.
A L/S T Total Bound
Cycles: 2 0 0 2 A
Shortest Path: 1 0 0 1 A
Longest Path: 1 0 0 1 A
Note: The cycles counts do not include possible stalls due to cache misses.
I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

Efficient memory management in R

I have 6 GB memory in my machine (Windows 7 Pro 64 bit) and in R, I get
> memory.limit()
6141
Of course, when dealing with big data, memory allocation error occurs. So in order to make R to use virtual memory, I use
> memory.limit(50000)
Now, when running my script, I don't have memory allocation error any more, but R hogs all the memory in my computer so I can't use the machine until the script is finished. I wonder if there is a better way to make R manage memory of the machine. I think something it can do is to use virtual memory if it is using physical memory more than user specified. Is there any option like that?
Look at the ff and bigmemory packages. This uses functions that know about R objects to keep them on disk rather than letting the OS (which just knows about chunks of memory, but not what they represent).
R doesn't manage the memory of the machine. That is the responsibility of the operating system. The only reason memory.size and memory.limit exist on Windows is because (from help("Memory-limits")):
Under Windows, R imposes limits on the total memory allocation
available to a single session as the OS provides no way to do so:
see 'memory.size' and 'memory.limit'.
R objects also have to occupy contiguous space in RAM, so you can run into memory allocation issues with only a few large objects. You could probably be more careful with the number/size of objects you create and avoid using so much memory.
This is not a solution but a suggestion. Use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
Here an example
m = matrix(rnorm(1000), 2, 2)
d = as.data.frame(m)
object.size(m)
232 bytes
object.size(d)
808 bytes

Resources