can we read and program the microcodes of AMD processor? - encryption

we can know that microcodes in Intel processors is encrypted (as issued in "Intel® 64 and IA-32 Architectures Software Developer’s Manual"). One cannot programm the Intel microcodes as he wants.
So, does anyone know how about the AMD microcodes? Are the microcodes of AMD CPU encrypted ?
Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Thank you in advance!
(ps: Not the microcodes in GPU, but in CPU).

This article provides information on the microcode of AMD's Opteron (K8) family. It claims that it is not encrypted and provides information on the microcode format and updating the microcode.

Anyone knows how to program microcodes? It's doesn't limit on AMD or Intel CPUs.
Not too many people do that kind of work. It's often written with a C compiler tweaked to generate the necessary microcode.
To answer your question in regard "is there other processors accepting microcode?" FPGA's are only programmed using such. These are not CPUs, what you program in them "is written at the hardware level". The microcode changes the doors and the result is your program. It can become very tedious as everything runs in parallel (true hardware parallelism).

AMD microcode for recent processors is, indeed, encrypted and authenticated, much like Intel's. You need to have the proper crypto key to sign a microcode update the processor will accept.
Intel does it by embedding in the processor mask (hardware read-only) microcode a hash of the valid key(s?): the key itself is too large to bother embedding in the processor, so it will be present in the update data itself as seen here. Also, the Intel microcode update is actually an unified processor-package update data, it updates more than just the microcode for the decode unit. It can update all sort of internal processor parameters, as well as control sequences for other units than the decoder... it also has both opcode (and likely microcode) that the processor runs before(?)/after applying the update.

Related

OpenCL: Writing to pointer in main memory

Is it possible, using OpenCL's DMA capabilities, to write to a main memory address that is passed into the cl program? I understand doing so would likely break the program, but the intent here is to run a GPU process and then overwrite the address space of the CPU program used to run it, so breakage is expected.
Thanks!
Which version of the OpenCL API are you targeting?
In OpenCL 2.0 and above you can use Shared Virtual Memory (SVM) to share address between host and device(s) in platforms that support it.
You can get more information about it in the Intel OpenCL SVM overview.
If you are using previous versions, or your hardware does not support it, you can use pinned memory with the appropriate flags to clCreateBuffer. In particular, CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, see clCreateBuffer in Khronos.
Note that, when using CL_MEM_USE_HOST_PTR has some alignment restrictions.
In general, in OpenCL, when and how the DMA is used depends on the hardware platform, so you should refer to the vendor documentation for details.

AES Encrypting for ESP8266 implemented on Software or Hardware? How to implement?

I need to write a basic encryption program for ESP8266. I did read the datasheet (https://www.espressif.com/sites/default/files/documentation/0a-esp8266ex_datasheet_en.pdf), and them says that existis the methods of encrypt: WEP/TKIP/AES. My main question is: The AES method, is implemented on software or hardware? This module is very simple, (36KB RAM, 90MHz CPU clock), so the algorithm is heavy to process. If AES is implemented in hardware, I think this task gets simpler, but I don't know how to use this. I did read at web, and the examples uses a #include "AES.h" lib, I don't know if this is implemented on hardware or software. The site of ESP8266 don't reply this question. So, I wants know about this and how, or where I found help, to implement this.
Ps.: I don't want use Arduino.
Also, I've already use this, https://github.com/CHERTS/esp8266-devkit/tree/master/Espressif/examples/ESP8266. But, for little jobs.
It's a software implementation. The RTOS SDK contains two implementations of AES, one of them shared with the basic SDK - all in software:
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/mbedtls/library/aes.c
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/ssl/crypto/ssl_aes.c
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_SDK/third_party/mbedtls/library/aes.c
In addition, there's an implementation optimized for the AES-NI instruction set: https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/mbedtls/library/aesni.c
However, AES-NI is only implemented by certain Intel and AMD CPUs. So it will not be compiled.
There are no signs of a hardware implementation.

Checking if gpu is integrated or not

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

OpenCL "cross"-compile x64 / 32-bit-pointer GPU

I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.

Atomic operations between multiple devices

I am developing something in heterogeneous systems with CPU and GPU (AMD APU, in fact) with OpenCL. Since I will use atomic operations to guarantee the integrity of data, and the data is shared among CPU device and GPU device, on each of which there is a kernel running on the shared data. My question is: is atomic operation still valid between these two devices? Hope anyone can help me. Many thanks.
Appendix A of the OpenCL Specification covers the synchronization of memory objects between different devices. There is no guarantee both devices will access the memory objects at the same physical location: one of the devices may work on a copy of the buffer, and only synchronization as described in Appendix A will ensure the other devices gets a copy of it.
Your implementation on the AMD APU may allow both CPU and GPU to share the same address space, and may not require the inter device synchronization. I would suggest to check AMD documentations and experiment.

Resources