Are realpath() portability concerns obsolete? - unix

The traditional way to call the Unix realpath() function has been realpath(pathname, buf) where buf is a user-supplied buffer with room for PATH_MAX bytes. This is problematic since PATH_MAX is unnecessarily big for most filenames and yet can be smaller than the actual OS pathname length limit.
The ability to pass a NULL pointer in place of buf was later added. In this case realpath() will dynamically allocate a buffer of the right size using malloc(). This makes the function easy to use safely. Since NULL support was a later addition, it was not universally implemented and hence portable programs could not rely on it.
POSIX Issue 7, 2018 edition now guarantees NULL support. Endorsement by POSIX would seem to imply that the portability concerns have all but vanished. Are there any Unix systems in active use (e.g. from the last decade) where realpath() does not support giving a NULL buffer?

realpath(path, NULL) works on recent releases of at least the following systems:
Darwin
DragonFly BSD
FreeBSD
Haiku
Linux/glibc
Linux/musl
Minix
NetBSD
OpenBSD
Solaris (OmniOS)

According to the Gnulib documentation, the Gnulib developers last saw this issue on
Mac OS X 10.5 (end of support in 2011),
FreeBSD 6.4 (end of support in 2010),
OpenBSD 4.4 (end of support in 2009),
Solaris 10 (end of support in 2024).

Related

RISC-V 32/64-bit compatibility issues

Suppose you take an RV32 program and try running it on a 64-bit system, what compatibility issues are likely to arise?
As I understand it, the instruction encoding is the same, and on RISC-V (like other modern RISC architectures, though unlike x86), ALU operations automatically operate on whatever the word size is, so if you add the contents of a pair of registers, you will get a 32-bit or 64-bit addition as applicable. Load and store of course work on an explicitly specified size because they depend on how many bytes have been allocated in memory.
One theoretically possible compatibility issue would arise if the code depends on bits past 32 being discarded, e.g. add 2^31 to itself and compare the result with zero.
Another more practical issue would arise if the operating system supplies memory addresses outside the first 4 gigabytes, which would be garbled when the code stores the addresses in 32-bit variables.
Are there any other issues I am missing?
You are correct about both of those possible compatibility issues.
Additionally, some Control and Status Registers (namely cycleh, instreth, timeh) are not needed in RV64I and therefore don't exist. Any code which tries to access them should error.
However, there are instructions to use only the lower 32 bits for ALU operations. Which could potentially be changed by replacing the opcode and funct3 in the binary.
So with an operating system mode which returns only 32-bit addresses, it would be possible to replace a binary with a working 64 bit version so long as cycleh and friends aren't used.
References RISC-V Specification v2.2:
Chapter 4 of the RISC-V Spec. v2.2 outlines the differences from RV32I to RV64I.
Chapter 2.8 goes over Control and Status Registers
Table 19.3 lists all of the CSRs in the standard.

How does Open MPI implement datatype conversion?

MPI standard states that when parallel programs are running on heterogenerous environment, they may have different representations for a same datatype(like big endian and small endian machines for intergers), so datatype representation conversion might be needed when doing point to point communication. I don't know how Open MPI implements this.
For instance, current Open MPI uses UCX library defaultly, I have study some codes of UCX library and Open MPI's ucx module. However, for continuous datatype like MPI_INT, I didn't find any representation conversion happen. I wonder is it because I miss that part or the implementation didn't satisfy the standard?
If you want to run an Open MPI app on an heterogeneous cluster, you have to configure --enable-heterogeneous (this is disabled by default). Keep in mind this is supposed to work, but it is lightly tested, mainly because of a lack of interest/real use cases. FWIW, IBM Power is now little endian, and Fujitsu is moving from Sparc to ARM for HPC, so virtually all HPC processors are (or will soon be) little endian.
Open MPI uses convertors (see opal/datatype/opal_convertor.h) to pack the data before sending it, and unpack it once received.
The data is packed in its current endianness. Data conversion (e.g. swap bytes) is performed by the receiver if the sender has a different endianness.
There are two ways of using UCX : pml/ucx and pml/ob1+btl/ucx and I have tested none of them in a heterogeneous environment. If you are facing some issues with pml/ucx, try mpirun --mca pml ob1 ....

OpenCL "cross"-compile x64 / 32-bit-pointer GPU

I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.

IO COmpletion Ports for Mac OS X

Is there any equivalent of IO COmpletion ports on Mac OS X for implementing Asynchronous IO on files....
Thank you....
Unfortunately, no.
kqueue is the mechanism for high-performance asynchronous i/o on OSX and FreeBSD. Like Linux epoll it signals in the opposite end of i/o compared to IOCPs (Solaris, AIX, Windows). kqueue and epoll will signal when it's ok to attempt a read or a write, whereas IOCPs will callback when a read or a write has completed. Many find the signalling mechanism used by epoll and kqueue difficult to understand compared to the IOCP model. So while kqueue and IOCP are both mechanisms for high-performance asynchronous i/o, they are not comparable.
It is possible to implement IOCPs using epoll or kqueue and a thread pool. You can find an example of that in the Wine project.
Correction:
Mac OS X has an implementation of IOCP like functions in Grand Central Dispatch. It uses the GCD thread pool and kqueue APIs internally. Convinience functions are dispatch_read and dispatch_write. Like IOCP the asynchronous I/O functions in GCD signals at the completion of an I/O task, not when the file descriptor is ready like the raw kqueue API.
Beware that GCD APIs are not "fork safe", and cannot be used on both sides of a POSIX fork without an exec. If you do, the function call will never return.
Also beware that kqueue in Mac OS X is rumored to be less performant than kqueue in FreeBSD, so it might be better for development than production. GCD (libdispatch) is Open Source however, and can be used on other platforms as well.
Update Jan 3, 2015:
FreeBSD has GCD from version 8.1. Wine has epoll-based IOCP for Linux. It is therefore possible to use IOCP design to write server code that should run on Windows, Linux, Solaris, AIX, FreeBSD, MacOSX (and iOS, but not Android). This is different from using kqueue and epoll directly, where a Windows server must be restructured to use its IOCPs, and very likely be less performant.
Since you asked for a Windows specific feature for OS X, instead of using kqueue directly you may try libevent. It's a thin wrapper to different AIO mechanisms and it work on both platforms.
Use Kqueue
http://en.wikipedia.org/wiki/Kqueue

What are the limitations of kqueue?

The documentation for libev (source) says that:
Kqueue deserves special mention, as at the time of this writing, it was broken on all BSDs except NetBSD (usually it doesn't work reliably with anything but sockets and pipes, except on Darwin, where of course it's completely useless).
It also mentions that:
The kqueue syscall is broken in all known versions - most versions support only sockets, many support pipes.
So, what are the limitations of kqueue? Where are these limitations documented? Initial research turned up references to kernel panics on older operating systems (Mac OS X 10.3) and complaints about incorrect/incomplete documentation. I don't know how reliable these sources are.
In particular, if kqueue does work reliably with sockets (AF_UNIX, AF_INET, and AF_INET6) then I don't mind. I am particularly interested in information about the Mac OS X and FreeBSD implementations.
On OS X, you shouldn't have problems with AF_UNIX, AF_INET, and AF_INET6. You will have problems if you want to use it with a PTY on OS X < 10.9, as PTYs are unsupported on those versions. There is some evidence that on OS X 10.9, PTYs are finally supported.
If you try to use the non-file descriptor notifications you will start to run into other limitations (eg AIO is unsupported).
I'm not familiar with FreeBSD's kqueue implementation. Perhaps someone else who is can add some information about it.
kqueue is perfectly working on FreeBSD, at least for networking. I have tested myself networking stuff with up to 180k connected, active sockets. I don't know for AIO .. haven't tested myself.

Resources