Suppose you take an RV32 program and try running it on a 64-bit system, what compatibility issues are likely to arise?
As I understand it, the instruction encoding is the same, and on RISC-V (like other modern RISC architectures, though unlike x86), ALU operations automatically operate on whatever the word size is, so if you add the contents of a pair of registers, you will get a 32-bit or 64-bit addition as applicable. Load and store of course work on an explicitly specified size because they depend on how many bytes have been allocated in memory.
One theoretically possible compatibility issue would arise if the code depends on bits past 32 being discarded, e.g. add 2^31 to itself and compare the result with zero.
Another more practical issue would arise if the operating system supplies memory addresses outside the first 4 gigabytes, which would be garbled when the code stores the addresses in 32-bit variables.
Are there any other issues I am missing?
You are correct about both of those possible compatibility issues.
Additionally, some Control and Status Registers (namely cycleh, instreth, timeh) are not needed in RV64I and therefore don't exist. Any code which tries to access them should error.
However, there are instructions to use only the lower 32 bits for ALU operations. Which could potentially be changed by replacing the opcode and funct3 in the binary.
So with an operating system mode which returns only 32-bit addresses, it would be possible to replace a binary with a working 64 bit version so long as cycleh and friends aren't used.
References RISC-V Specification v2.2:
Chapter 4 of the RISC-V Spec. v2.2 outlines the differences from RV32I to RV64I.
Chapter 2.8 goes over Control and Status Registers
Table 19.3 lists all of the CSRs in the standard.
Related
When using SSE instructions/intrinsics, say for 256-bit registers, has anyone been able to reduce time spent loading the extended registers from memory by using either the prefetch instruction on the next 32-byte chunk, or by some other technique? Assume the data to be loaded is already properly aligned in memory.
See the x86 tag wiki for more info about x86 CPU performance. Hardware prefetchers are pretty good at locking onto patterns of sequential access, so you don't usually need software prefetch instructions.
Usually it's not a win to do a wide vector load an unpack it into separate integer registers. Once you've touched a cache line, more loads from it are cheap, and throughput from L1 cache into registers isn't usually the problem. Using ALU instruction to unpack a 256b load into separate 32 or 64b integers just takes more instructions and means you're more likely to bottleneck on ALU throughput.
I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.
I am reading Intel virtualization manual where manual says that if bit 6 of EPTP(a VM execution control field) is set, the processor will set the Accessed and dirty bits in relevant EPT entries according to some rules.
I am trying to understand that if processor sets A/D bits in EPT on access and modification of relevant pages how guest Operating will get benefit from this setting as guest Os has no access to EPT. In my understanding A/D bits are used by memory manager of the OS for optimization and swapping algorithms and there is no role of these bits in page walker.
I(being programmer of VMM) have to add code in VMM to search the relevant entry in GPA space and mark the bits accordingly?
If this is the case then how can we say that these bits are set with out the knowledge of VMM?
kvm way of dealing this issue will be a good answer also
In general, the guest OS would not benefit from the access and dirty bits in the EPT from being set. As you stated the guest does not typically have access to the EPT. This is purely for the hypervisor/VMM. It is analogous to the dirty and access bit in a process page table, the process does not use it, only the OS.
With regard to your second question, it is a bit unclear so I'm not sure what you are asking. However, the hardware will mark the access and dirty bits assuming it has been set up correctly, you do not have to do it manually.
I have program that use OpenCL for calculation, OpenCL code is big and compile time is about 2 minutes with 100% CPU load. Of course i save binary results of compilation. And second launch load opencl program from binary. Can i use same binary on another video card with same chip but different characteristics (RAM,CLOCK,etc.)?
As far as the OpenCL specification is concerned, you only have guarantees that a program binary can be re-used on the same device on which it was created.
In reality, the binaries that are returned by many OpenCL implementations are compatible with a wider range of devices available from that same vendor. For example, NVIDIA return PTX when you request binaries from their implementation, which is a reasonably high level intermediate representation (i.e. not native instructions). This is certainly compatible with other devices using the same architecture on which it was created (e.g. all GK110 devices, or all GF104 devices), and quite likely to be portable across a range of other NVIDIA GPU architectures too. Other vendors also return various types of intermediate representation (usually LLVM IR based) that allow this kind of binary compatibility.
So yes, you can probably re-use binaries across different devices that have the same architecture, but you'll really just have to try it and see. You could always implement a scheme that tries to use the binary and it that fails resort back to the source code.
In the future, we will hopefully see a large number of vendors supporting the recently ratified SPIR specification, which is a platform-portable intermediate representation for OpenCL device programs. This would allow you to generate binaries that are not only compatible with devices from a single vendor's architecture, but also across devices from many other vendors that also support SPIR. There would clearly be some remaining compilation overhead to lower SPIR to the native instruction set, but this should still result in significant speed-ups compared to compiling raw OpenCL C code.
I will try to explain my problem. There are 365 (global map)files in two directories dir1 and dir2, which have the same format ,byte,extend,etc. I computed the bias between two datasets using the function and code given below as follows:
How can I solve this problem?please
I suspect this is due to memory limitations on a 32-bit system. You want to allocate an array of 933M doubles, that requires 7.6Gb of continuous memory. I suggest you to read ?Memory and ?"Memory-limits" for more details. In particular, the latter says:
Error messages beginning ‘cannot allocate vector of size’ indicate
a failure to obtain memory, either because the size exceeded the
address-space limit for a process or, more likely, because the
system was unable to provide the memory. Note that on a 32-bit
build there may well be enough free memory available, but not a
large enough contiguous block of address space into which to map
it.
If this is indeed your problem, you may look into bigmemory package (http://cran.r-project.org/web/packages/bigmemory/index.html) which allows to manage massive matrixes with shared and file-based memory. There are also other strategies (e.g. using an SQLite database) to manage data that doesn't fit in memory all at once.
Update. Here is an excerpt from Memory-limit for Windows:
The address-space limit is 2Gb under 32-bit Windows unless the OS's default has been changed to allow more (up to 3Gb). See http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx and http://msdn.microsoft.com/en-us/library/bb613473(VS.85).aspx. Under most 64-bit versions of Windows the limit for a 32-bit build of R is 4Gb: for the oldest ones it is 2Gb. The limit for a 64-bit build of R (imposed by the OS) is 8Tb.
It is not normally possible to allocate as much as 2Gb to a single vector in a 32-bit build of R even on 64-bit Windows because of preallocations by Windows in the middle of the address space.
Under Windows, R imposes limits on the total memory allocation available to a single session as the OS provides no way to do so: see memory.size and memory.limit.