ECC error injection on Intel Xeon C5500 platform and issue with unlocking Integrated memory controller registers

ECC error injection on Intel Xeon C5500 platform and issue with unlocking Integrated memory controller registers - intel

I am working on Error Detection module and was attempting to test using the error injection implementation mentioned in Intel® Xeon® Processor C5500/C3500 Series Datasheet, Volume 2 in section 4.12.40. It asks to program the MC_CHANNEL_X_ADDR_MATCH, MC_CHANNEL_X_ECC_ERROR_MASK and MC_CHANNEL_X_ECC_ERROR_MASK registers but attempting to write to this has no effect. Realized there is a lock for this space which is indicated by status in MEMLOCK_STATUS register (device 0: function 0: offset 88h), which in my case is reporting 0x40401 as the set value. This means MEM_CFG_LOCKED is set and I am not able to even unlock using the MC_CFG_CONTROL register (device 0:function 0: offset 90h). I am writing 0x2 to this register but that does not help to unlock the ECC injection registers for writing. How can I achieve this? I am running FreeBSD on the bare metal and not as a virtual machine.

To the best of my knowledge, the whole TXT thing that is necessary for this is not supported on FreeBSD.
But this a quite an arcane area. You might have more luck asking this on the freebsd-hackers mailing list.

Related

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.

It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.

The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

How to programmatically determine which is the boot disk on Solaris/illumos?

On a test server there are two Samsung 960 Pro SSDs, exactly same maker, model and size. On both I've installed a fresh install of exactly the same OS, OmniOS r15026.
By pressing F8 at POST time, I can access the motherboard BOOT manager, and choose one of the two boot drives. Thus, I know which one the system booted from.
But how can one know programmatically, after boot, which is the boot disk?
It seems that is:
Not possible on Linux,
Not possible on FreeBsd
Possible on macOS.
Does Solaris/illumos offer some introspective hooks to determine which is the boot disk?
Is it possible to programmatically determine which is the boot disk on Solaris/illumos?
A command line tool would be fine too.
Edit 1: Thanks to #andrew-henle, I have come to know command eeprom.
As expected it is available on illumos, but on test server with OmniOS unfortunately it doesn't return much:
root#omnios:~# eeprom
keyboard-layout=US-English
ata-dma-enabled=1
atapi-cd-dma-enabled=1
ttyd-rts-dtr-off=false
ttyd-ignore-cd=true
ttyc-rts-dtr-off=false
ttyc-ignore-cd=true
ttyb-rts-dtr-off=false
ttyb-ignore-cd=true
ttya-rts-dtr-off=false
ttya-ignore-cd=true
ttyd-mode=9600,8,n,1,-
ttyc-mode=9600,8,n,1,-
ttyb-mode=9600,8,n,1,-
ttya-mode=9600,8,n,1,-
lba-access-ok=1
root#omnios:~# eeprom boot-device
boot-device: data not available.
Solution on OmniOS r15026
Thanks to #abarczyk I was able to determine the correct boot disk.
I had to use a slightly different syntax:
root#omnios:~# /usr/sbin/prtconf -v | ggrep -1 bootpath
value='unix'
name='bootpath' type=string items=1
value='/pci#38,0/pci1022,1453#1,1/pci144d,a801#0/blkdev#w0025385971B16535,0:b
With /usr/sbin/format, I was able to see entry corresponds to
16. c1t0025385971B16535d0 <Samsung-SSD 960 PRO 512GB-2B6QCXP7-476.94GB>
/pci#38,0/pci1022,1453#1,1/pci144d,a801#0/blkdev#w0025385971B16535,0
which is correct, as that is the disk I manually selected in BIOS.
Thank you very much to #abarczyk and #andrew-henle to consider this and offer instructive help.

The best way to find the device from which the systems is booted is to check prtconf -vp output:
# /usr/sbin/prtconf -vp | grep bootpath
bootpath: '/pci#0,600000/pci#0/scsi#1/disk#0,0:a'

On my Solaris 11.4 Beta system, there is a very useful command called devprop which helps answer your question:
$ devprop -s bootpath
/pci#0,0/pci1849,8c02#1f,2/disk#1,0:b
then you just have to look through the output of format to see what that translates to. On my system, that is
9. c2t1d0 <ATA-ST1000DM003-1CH1-CC47-931.51GB>
/pci#0,0/pci1849,8c02#1f,2/disk#1,0

Use the eeprom command.
Per the eeprom man page:
Description
eeprom displays or changes the values of parameters in the EEPROM.
It processes parameters in the order given. When processing a
parameter accompanied by a value, eeprom makes the indicated
alteration to the EEPROM; otherwise, it displays the parameter's
value. When given no parameter specifiers, eeprom displays the values
of all EEPROM parameters. A '−' (hyphen) flag specifies that
parameters and values are to be read from the standard input (one
parameter or parameter=value per line).
Only the super-user may alter the EEPROM contents.
eeprom verifies the EEPROM checksums and complains if they are
incorrect.
platform-name is the name of the platform implementation and can be
found using the –i option of uname(1).
SPARC
SPARC based systems implement firmware password protection with
eeprom, using the security-mode, security-password and
security-#badlogins properties.
x86
EEPROM storage is simulated using a file residing in the
platform-specific boot area. The /boot/solaris/bootenv.rc file
simulates EEPROM storage.
Because x86 based systems typically implement password protection in
the system BIOS, there is no support for password protection in the
eeprom program. While it is possible to set the security-mode,
security-password and security-#badlogins properties on x86 based
systems, these properties have no special meaning or behavior on x86
based systems.

MPI - Add/remove node while program is running

Is there an MPI implementation that allows nodes to be dynamically added/removed at runtime? Do any recover from complete hardware failure of a node, allowing the node to be repaired and relaunched without restarting the program?

Is there an MPI implementation that allows nodes to be dynamically added/removed at runtime?
This is actually two questions. Nodes usually can be dynamically added at runtime using calls like MPI_Comm_spawn. As #Hristo pointed out in the comments, you should set the correct info key in Open MPI. It may also be possible in other implementations. As for removing nodes, that's a big area of research at the moment. Most MPI implementations currently have varying levels of success surviving a total node failure. In the current releases of Open MPI, I don't believe there is any support for that sort of failure [citation needed], though there is work to improve that ongoing. In the current version of MPICH, you can pass the flag -disable-auto-cleanup to mpiexec and it will not automatically clean up your application after a process/node failure. However, you'll still have to modify your MPI application to handle this situation. The various derivatives of MPICH (Intel MPI, Cray MPI, IBM MPI, MVAPICH, etc.) all don't support this feature AFAIK. There are other research implementations that are also available to extend the support of the MPI Standard. User Level Failure Mitigation is currently being considered by the standardization body as a way of letting the user handle process failures. There is a research implementation based on Open MPI available at the website linked, and an experimental prototype will also be in the next version of MPICH (3.2).
Do any recover from complete hardware failure of a node, allowing the node to be repaired and relaunched without restarting the program?
This is essentially the same process as above. You would need to use the APIs to remove a process and then somehow find out that it's available and add it back using spawn. These calls have to be made from inside the application though, not externally.

UEFI runtime service next to OS

I had the idea of running a small service next to the OS but I'm not sure if it is possible. I tried to figure it out by reading some docs but didn't get far, so here comes my question.
I read about the UEFI runtime services.
Would it be possible to have a small module in the firmware which runs next to what ever operating system is used and that sends information concerning the location of the device to an address on the internet?
As far as my knowledge goes, I would say that it should not possbile to run something in the background once UEFI handed the control over to the OS kernel.
To clarify my intentions, I would like to have something like that on my laptop. There is the Prey project but it is installed inside the OS. I'm using a Linux distribution without autologin. If somebody would steal it they would probably just install Windows.

What you want to do is prohibited because that would be the gateway for viruses, loggers and other malwares.
That said, if you want to get some code running aside of the OS, you should look at System Management Mode (SMM).
SMM is an execution mode of the x86 processors orthogonal to the standard protected mode. SMM allows the BIOS to completely suspend the OS on all the CPUs at once and enter in SMM mode to execute some BIOS services. Switching to SMM mode happens right now on your x86 machine, as you're reading this Stackoverflow answer. It is triggered either by:
hardware: a dedicated System Management Interrupt line (SMI#), very similar to how IRQs work,
software: via an I/O access to a location considered special by the motherboard logic (port 0xb2 is common).
SMM services are called SMM handlers and for instance sensors values are very often retrieved by the means of a SMM call to a SMM handler.
SMM handlers are setup during the DXE phase of UEFI firmware initialization into the SMRAM, an area dedicated to SMM handlers. See the following diagram:
SMM Drivers are dispatched by the SMM Core during the DXE Phase. So
additional SMI handlers can be registered in the DXE Phase. Late in
the DXE Phase, when no more SMM drivers can be dispatched, SMRAM will
be locked down (as recommended practice). Once SMRAM is locked down,
no additional SMM Drivers may be dispatched, so not additional SMI
handlers can be registered. For example, an SMM Driver that registers
an SMI handler cannot be loaded from the EFI Shell or be added as a
DriverOption in the UEFI Boot Manager.
source: tianocore
This means that the code of your SMM handler must be present in the BIOS image, which implies rebuilding the BIOS with your handler added. It's tricky but tools exist out there to both provide a DXE environment and build your SMM handler code into a PE executable, as well as other tools to add a DXE driver to an existing BIOS image. Not all BIOS manufacturers are supported though. It's risky unless your Flash chip is in a socket and you can reprogram it externally.
But the first thing is to check if the SMRAM is locked on your system. If you're lucky you can add your very own SMM handler directly in SMRAM. It's fidly but doable.
Note: SMM handlers inside the BIOS are independent from the OS so it would run even if a robber installs a new Operating System, which is what you want. However being outside of an OS has huge disadvantages: you'd need to embedd in your SMM handler a driver for the network interface (a polling-only, interrupt-less driver!) and wlan 802.11, DHCP and IP support to connect to the Wifi and get your data routed to an external host on the Internet. How would you determine the wifi SSID and password? Well you could wait for the OS to initialize the network adapter for you, but you'd need to save/restore the full state of the network host controller between calls. Not a small or easy project.

As far as my knowledge goes, I would say that it should not possbile to run something in the background once UEFI handed the control over to the OS kernel.
I agree. Certainly, the boot environment (prior to ExitBootServices() only uses a single threaded model.
There is no concept of threads in the UEFI spec as far as I can see. Furthermore each runtime service is something the OS deliberately invokes, much like the OS provides system calls for applications. Once you enter a runtime system function, note the following restriction from 7.1:
Callers are prohibited from using certain other services from another processor or on the same
processor following an interrupt as specified in Table 30.
Depending on which parts of the UEFI firmware your runtime service needed access to would depend on which firmware functions would be non-reentrant whilst your call was busy.
Which is to say that, even if you were prepared to sacrifice a thread to sit eternally inside an EFI runtime service, you could well block the entire rest of the kernel from using other runtime services.
I do not think it is going to be possible unfortunately, but interesting question all the same!

Serial Comms dies in WinXP

A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.
The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).
The solution was to provide customers with USB->Serial converters along with the hardware.
Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.
Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.
Google turns up zero problems that we experience.
I was wondering:
Has anyone experienced this before? If so what did you do to fix/workaround the problem?
Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?
Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.
Updates:
The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
Flow control is not used.
Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.

Are you using DOS-style serial code, or the Win32 CreateFile approach?
If the former, be very suspicious: if at all possible I'd convert to the latter.
If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)
You might also check into the queue size, which you can set with the SetupComm function.
I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.
edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)
edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)

Are you sure you have your flow control set up correctly? DTR, RTS, etc...
-Adam

i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.
My serial port

Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).
In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex