How malloc() and sbrk() works in unix? - unix

I am new to UNIX, and I am studying some of UNIX system calls such as brk(), sbrk(), and so on....
Last day I have read about malloc() function, and I was confused a little bit!
Can anybody tell me why malloc reduces the number of sbrk() system calls that the program must perform?
And another question, do brk(0), sbrk(0) and malloc(0) return the same value?

Syscalls are expensive to process because of the additional overhead that a syscall places: you have to switch to kernel mode. A system call gets into the kernel by issuing a "trap" or interrupt. It's a call to the kernel for a service, and because it executes in the kernel address space, it has a high overhead switch to kernel (and then switching back).
This is why malloc reduces the number of calls to sbrk() and brk(). It does so by requesting more memory than you asked it to, so that it doesn't have to issue a syscall everytime you need more memory.
brk() and sbrk() are different.
brk is used to set the end of the data segment to the value you specify. It says "set the end of my data segment to this address". Of course, the address you specify must be reasonable, the operating system must have enough memory, and you can't make it point to somewhere that would otherwise exceed the process maximum data size. Thus, brk(0) is invalid, since you'd be trying to set the end of the data segment to address 0, which is nonsense.
On the other hand, sbrk increments the data segment size by the amount you specify, and returns a pointer to the previous break value. Calling sbrk with 0 is valid; it is a way to get a pointer to the current data segment break address.
malloc is not a system call, it's a C library function that manages memory using sbrk. According to the manpage, malloc(0) is valid, but not of much use:
If size is 0, then malloc() returns either NULL, or a unique pointer
value that can later be successfully passed to free().
So, no, brk(0), sbrk(0) and malloc(0) are not equivalent: the first of them is invalid, the second is used to obtain the address of the program's break, and the latter is useless.
Keep in mind that you should never use both malloc and brk or sbrk throughout your program. malloc assumes it's got full control of brk and sbrk, if you interchange calls to malloc and brk, very weird things can happen.

why malloc reduces the number of sbrk() system calls that the program
must perform?
say, if you call malloc() to request 10 bytes memory, the implementation may use sbrk (or other system call like mmap) to request 4K bytes from OS. Then when you call malloc() next time to request another 10 bytes, it doesn't have to issue system call; it may just return some memory allocated by system call of the last time 4K.

malloc() function is used to call the sbrk system call to create a memory dynamically during the process.
malloc() function is already assigned in stdlib.h header file so the as per the required function is recursively call by the malloc function using the library function.
with the help of sbrk we need to explicitly declare some thing to call the system call.
According to the size given in function or through system call it return to the variable and store.

sbrk() function increases the programs data segment allocation by specified bytes.
malloc(4096); // sbrk += 4096 Bytes
free(); // freeing memory will not bring down the sbrk by 4096 Bytes
malloc(4096); // malloc'ing again will not increase the sbrk and it will use
the existing space which not result in sbrk() call.

Related

the suggested way to use clEnqueueMapBuffer and clEnqueueUnmapMemObject when implementing zero copy

I am playing deep learning with opencl, the output size of the tensor is fixed.
In cuda, I can use zero copy via cudaMallocHost, this can be called in the initialization. And I can read the output of the tensor from the host without explicitly calling cudaMemcpy.
It's very efficient since it's called only one time over the entire execution of my program. I don't need to call cudaMallocHost every time after forwarding.
And when I try to implement zero copy in opencl, in some implementations they call clEnqueueMapBuffer and clEnqueueUnmapMemObject every time after forwarding when you want to read the output of the tensor.
Here is the example (https://github.com/alibaba/MNN/blob/master/source/backend/opencl/core/OpenCLBackend.cpp#L291).
But I find that the overhead of clEnqueueMapBuffer can not be neglected, sometimes the latency is quite large.
Is this really suggested way to do so? Can I call clEnqueueMapBuffer only one time in the lifetime of my program and call clEnqueueUnmapMemObject one time when the end of my program? is there any issue to do so?
If your OpenCL implementation supports Shared Virtual Memory (introduced in 2.0), that feature allows you to do something similar, and much more.
For OpenCL 1.x, unless your OpenCL implementation makes any guarantees above and beyond the standard (which I'd expect it to do via an extension), you must unmap a buffer before a kernel gets write access to it, and likewise, you must not allow a kernel to read from it while it is mapped for writing.
This is explained in the clEnqueueMapBuffer specification:
Reads and writes by a kernel executing on a device to a memory region(s) mapped for writing are undefined.
The behavior of writes by a kernel executing on a device to a mapped region of a memory object is undefined.
In version 1.2, this was expanded, but the gist is the same:
If a memory object is currently mapped for writing, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that read from or write to this
memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects)
or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin
execution; otherwise the behavior is undefined.
If a memory object is currently mapped for reading, the application must ensure that the memory
object is unmapped before any enqueued kernels or commands that write to this memory object
or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent
object (if the memory object is a sub-buffer or 1D image buffer object) begin execution;
otherwise the behavior is undefined.
If you find that map/unmap has a high overhead, you are probably not hitting a zero-copy code path in your OpenCL implementation, and the driver is actually copying the memory contents. If in doubt, check with your implementation vendor to see how they recommend you implement zero-copy buffers in OpenCL. Zero-copy buffers are not guaranteed by the standard.

Concurrent write with OCaml Async

I'm reading data from the network and I'd like to write it to a file whenever I get them. The writes are concurrent and non sequential (think P2P file sharing). In C, I would get a file descriptor to the file(for the duration of the program) then use lseek, followed by write and eventually close the fd. These operations could be protected by a mutex in a multithreaded setting (especially, lseek and write should be atomic).
I don't really see how to get that behavior in Async. My initial idea is to have something like this.
let write fd s pos =
let posl = Int64.of_int pos in
Async_unix.Unix_syscalls.lseek fd ~mode:`Set posl
>>| fun _ ->
let wr = Writer.create t.fd in
let len = String.length s in
Writer.write wr s ~pos:0 ~len
Then, the writes are scheduled asynchronously when data is received.
My solution isn't correct. For one thing, this write task need to be atomic but it is not the case, since two lseek can be executed before the first Writer.write. Even if I can schedule the write sequentially it wouldn't help since Writer.write doesn't return a Deferred.t. Any idea?
BTW, this is a follow-up to a previous answered question.
The basic approach would be to have a queue of workers, where each worker performs an atomic seek/write1 operation. The invariant is that only one worker at a time is running. A more complicated strategy would employ a priority queue, where writes are ordered by some criterium that maximizes the throughput, e.g., writes to subsequent positions. You may also implement a sophisticated buffering strategy if you observe lots of small writes, then a good idea would be to coalesce them into larger chunks.
But let's start with a simple non-prioritized queue, implemented via Async.Pipe.t. For the positional write, we can't use the Writer interface, as it is designed for buffered sequential writes. So, we will use the Unix.lseek from Async_unix.Std and Bigstring.really_writefunction. The really_write is a regular non-asynchronous function, so we need to lift it into the Async interface using theFd.syscall_in_thread` function, e.g.,
let really_pwrite fd offset bytes =
Unix.lseek fd offset ~mode:`Set >>= fun (_ : int64) ->
Fd.syscall_in_thread fd (fun desc ->
Bigstring.really_write desc bytes)
Note: this function will write as many bytes as system decides, but no more than the length of bytes. So you might be interested in implementing a really_pwrite function that will write all bytes.
The overall scheme would include one master thread, that will own a file descriptor and accept write requests from multiple clients via an Async.Pipe. Suppose that each write request is a message of the follwing type:
type chunk = {
offset : int;
bytes : Bigstring.t;
}
Then your master thread will look like this:
let process_requests fd =
Async.Pipe.iter ~f:(fun {offset; bytes} ->
really_pwrite fd offset bytes)
Where the really_pwrite is a function that really writes all the bytes and handles all the errors. You may also use Async.Pipe.iter' function and presort and coalesce the writes before actually executing the pwrite syscall.
One more optimization note. Allocating a bigstring is a rather expensive operation, so you may consider to pre allocate one big bigstring and serve small chunks from it. This will create a limited resource, so your clients will wait until other clients will finish their writes and release their chunks. As a result, you will have a throttled system with a limited memory footprint.
1)Ideally we should use pwrite though Janestreet provides only pwrite_assume_fd_is_nonblocking function, that doesn't release OCaml runtime when a call to the system pwrite is done, and will actually block the whole system. So we need to use a combination of a seek and write. The latter will release the OCaml runtime so that the rest of the program can continue. (Also, given their definition of nonblocking fd, this function doesn't really make much sense, as only sockets and FIFO are considered non-blocking, and as far as I know, they do not support the seek operation. I will file an issue on their bug tracker.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.

what happens when you invoke malloc() on a unix system

malloc() library function internally calls brk() or sbrk() system call,which allocates memory fo data region,so local static variables and global variables will have allocation of memory from heap increasing the effective size of data region.now my question is what exactly is happening when i allocate memory to int *a?which is local variable.
i might have misconception please let me know if any.thanks
int *p itself is a local variable, which is a pointer (these days: usually four or eight bytes, usually on the stack or in a register). When you do p = malloc(...), you are allocating memory (on the heap - or what is these days conventionally called 'the heap' even if a heap is not the structure used to manage free memory) and assigning a pointer to that memory into p.
When you call malloc() you get access to the amount of memory requested, or NULL is returned. That is all that is guaranteed. Everything else is implementation dependent. The mechanism by which you get access to that memory can be quite varied.

Resources