Recovering line of code that caused kernel panic - r

I'm running CentOS 8.1 and my machine has kernel panic'd. I installed the kernel-debuginfo package and I am generally following the steps in Section 7.11 : Analyzing a core dump.
Here is an abbreviated version of my debugging session :
# crash /usr/lib/debug/usr/lib/modules/4.18.0-147.el8.x86_64/vmlinux /var/crash/XXX/vmcore
.
.
.
WARNING: kernel relocated [336MB]: patching 93296 gdb minimal_symbol values
KERNEL: /usr/lib/debug/usr/lib/modules/4.18.0-147.el8.x86_64/vmlinux
DUMPFILE: /var/crash/XXX/vmcore [PARTIAL DUMP]
CPUS: 48
DATE: Sun Jan 10 13:36:04 2021
UPTIME: 23 days, 22:18:40
LOAD AVERAGE: 10.00, 10.01, 10.00
TASKS: 1966
NODENAME: YYY
RELEASE: 4.18.0-147.el8.x86_64
VERSION: #1 SMP Wed Dec 4 21:51:45 UTC 2019
MACHINE: x86_64 (2794 Mhz)
MEMORY: 2035.9 GB
PANIC: "Kernel panic - not syncing: Hard LOCKUP"
PID: 27666
COMMAND: "R"
TASK: ffff8ff017978000 [THREAD_INFO: ffff8ff017978000]
CPU: 2
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 27666 TASK: ffff8ff017978000 CPU: 2 COMMAND: "R"
#0 [ffffb5187c6c7a50] machine_kexec at ffffffff96057c4e
#1 [ffffb5187c6c7aa8] __crash_kexec at ffffffff96155b8d
#2 [ffffb5187c6c7b70] panic at ffffffff960b0578
#3 [ffffb5187c6c7bf8] watchdog_overflow_callback.cold.8 at ffffffff9618bb11
#4 [ffffb5187c6c7c08] __perf_event_overflow at ffffffff961f54f2
#5 [ffffb5187c6c7c38] x86_pmu_handle_irq at ffffffff96007a16
#6 [ffffb5187c6c7e88] amd_pmu_handle_irq at ffffffff96008b14
#7 [ffffb5187c6c7ea0] perf_event_nmi_handler at ffffffff960060cd
#8 [ffffb5187c6c7eb8] nmi_handle at ffffffff96021843
#9 [ffffb5187c6c7f10] default_do_nmi at ffffffff96021cce
#10 [ffffb5187c6c7f30] do_nmi at ffffffff96021ea8
#11 [ffffb5187c6c7f50] nmi at ffffffff96a01537
RIP: 0000146816a69d6e RSP: 00007ffc378f6270 RFLAGS: 00000216
RAX: 000000006655e8df RBX: 0000000000000003 RCX: 000000000003ad90
RDX: 0000000000000003 RSI: 0000000007ccb060 RDI: 000000076fe90140
RBP: 0000000000085fc6 R8: 0000000000b13039 R9: 00000007770c0758
R10: 0000000778ba8a38 R11: 0000000000000c36 R12: 000000000070e070
R13: 0000000000000000 R14: 00007ffc378f6410 R15: 00007ffc378f6430
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
Clearly the culprit is an R process which I was running (see COMMAND: "R"). Looking at the bt above, it seems like it is only returning the kernel level functions. I want to know what line in my R code (or installed R libraries) causing the issue. Trying
crash> gdb bt
No stack.
gdb: gdb request failed: bt
crash>
is not useful. Looking at man crash it isn't directly obvious how to do that. Some of the R libraries possibly involved have C++ code that was compiled with debug flags. I have a vague hope of the cause of the error being in one of these.
QUESTION :
How do I use the Linux crash utility to recover the line of code (either R or C++) that caused the kernel to panic?
What is a "Kernel panic - not syncing: Hard LOCKUP"

Related

mpirun running job serially with only one core

I have installed mpich4.1 in ubuntu machine using GNU compiler. In the beginning I ran one job successfully using mpirun on '36' cores, but now when I'm trying to run same job it's running serially using only one core. Now the command output of mpirun -np 36 ./wrf.exe is
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
starting wrf task 0 of 1
The mpivars gives error with
Abort(470406415): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67): MPI_Init_thread(argc=0x7fff8044f34c, argv=0x7fff8044f340, required=0, provided=0x7fff8044f350) failed
MPII_Init_thread(222)...: gpu_init failed
But the machine is not having GPU.
The mpi version command gives
HYDRA build details:
Version: 4.1
Release Date: Fri Jan 27 13:54:44 CST 2023
CC: gcc
Configure options: '--disable-option-checking' '--prefix=/home/MODULES' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ofi__ -I/home/MODULES/mpich-4.1/src/mpl/include -I/home/MODULES/mpich-4.1/modules/json-c -D_REENTRANT -I/home/MODULES/mpich-4.1/src/mpi/romio/include -I/home/MODULES/mpich-4.1/src/pmi/include -I/home/MODULES/mpich-4.1/modules/yaksa/src/frontend/include -I/home/MODULES/mpich-4.1/modules/libfabric/include'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Demux engines available: poll select
What could be the possible reason for this?
Thanks in advance.

SDL_Init(SDL_INIT_VIDEO) generating excpetion dbus[3856] about dbus library

I developed a C++ application that among others use SDL2.
It compiles and run on Ubuntu 18.04 (64 bit machine) and on OSX.
When I try to compile the application on a computer on chip running Ubuntu Mate 18.04 in a 32 bits machine with the libraries compiled by me, it returns an exception.
The exception is happening inside the line SDL_Init( SDL_INIT_VIDEO ) and the message is the following:
dbus[3856]: arguments to dbus_message_new_method_call were incorrect, assertion "path != NULL" failed in file ../../../dbus/dbus-message.c ine 1362.
This is normally a bug in some application using D-Bus library.
D-Bus not built with -rdynamic so unable to print backtrace.
Aborted
It seems a problem happening on Ubuntu running on a 32 bit machine. Have anyone had the same problem?
How can I solve this problem? Or does anyone know what is generating it?
NOTE 1
Running in gdb and using backtrace as suggested:
#0 0xb613d206 in __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1 0xb614ab32 in __libc_signal_restore_set (set=0xbeffe3a4) at ../sysdeps/unix/sysv/linux/nptl-signals.h:80
#2 0xb614ab32 in __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3 0xb614b82e in __GI_abort () at abort.c:79
#4 0xb2adddf0 in _dbus_abort () at /lib/arm-linux-gnueabihf/libdbus-1.so.3
#5 0xb2ad7c5a in _dbus_warn_check_failed () at /lib/arm-linux-gnueabihf/libdbus-1.so.3
#6 0xb2ad8104 in _dbus_warn_return_if_fail () at /lib/arm-linux-gnueabihf/libdbus-1.so.3
#7 0xb2acce80 in dbus_message_new_method_call () at /lib/arm-linux-gnueabihf/libdbus-1.so.3
#8 0xb68acdba in () at /usr/lib/arm-linux-gnueabihf/libSDL2-2.0.so.0
NOTE 2
The same happens if I install SDL from the repository instead of building it.
I found a partial solution so anybody with a better one would be welcome.
As #Keltar suggested in the comment I debugged it with gdb and backtrace as already said in the question. The last line of backtrace got me suspicious:
#8 0xb68acdba in () at /usr/lib/arm-linux-gnueabihf/libSDL2-2.0.so.0
which suggested it is not taking the library built by me but the one installed with apt. I checked all the paths in Makefile and everything was pointing to my custom built. So I checked the dynamic library files *.so and I found the situation was the following:
drwxr-xr-x 3 odroid odroid 4096 Oct 15 12:21 cmake
lrwxrwxrwx 1 odroid odroid 17 Oct 15 12:21 libSDL2-2.0d.so -> libSDL2-2.0d.so.0
lrwxrwxrwx 1 odroid odroid 21 Oct 15 12:21 libSDL2-2.0d.so.0 -> libSDL2-2.0d.so.0.8.0
-rw-r--r-- 1 odroid odroid 4317020 Oct 15 12:21 libSDL2-2.0d.so.0.8.0
lrwxrwxrwx 1 odroid odroid 15 Oct 15 12:40 libSDL2-2.0.so -> libSDL2-2.0d.so
-rw-r--r-- 1 odroid odroid 7593354 Oct 15 12:21 libSDL2d.a
-rw-r--r-- 1 odroid odroid 4754 Oct 15 12:19 libSDL2maind.a
lrwxrwxrwx 1 odroid odroid 14 Oct 15 13:38 libSDL2.so -> libSDL2-2.0.so
drwxr-xr-x 2 odroid odroid 4096 Oct 15 12:21 pkgconfig
Note that for the symbolic link file libSDL2.so -> libSDL2-2.0.so, the linked file libSDL2-2.0.so was not created.
My solution was to rebuild the library in Release mode instead of Debug using cmake -DCMAKE_BUILD_TYPE=Release: in that way all the files library are correct. Another solution might be to symlink the file(s) manually but is is easy to mess up something.
So it seems a bug in how SDL cmake write the instructions for the Debug build type because in release mode my program run.
If anyone has a better solution or find any error that I could not find please let me know.

intel_pt data cannot be imported properly into Intel VTune 2018

I am using Intel VTune 2018 to profile and derive the control flow dependencies by making use of the Intel_PT PMU under the system:
Kernel: 4.15.0-13-generic, 64bit Ubuntu
CPU: Intel® Core™ i7-7820X # 3.60GHz × 16
I started with the following commands:
1- amplxe-perf record -o a.perf -T -e intel_pt// -- ps
PID TTY TIME CMD
21471 pts/1 00:00:00 amplxe-perf
21472 pts/1 00:00:00 ps
58693 pts/1 00:00:00 sudo
58694 pts/1 00:00:00 su
58695 pts/1 00:00:00 bash
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 3.154 MB a.perf ]
2- amplxe-cl -import a.perf -r folder
amplxe: Importing a new result 100 % done
amplxe: Using result path `/home/amad/May2/folder'
amplxe: Executing actions 12 % Loading 'a.perf' file
amplxe: Error: Cannot load data file `/home/amad/May2/folder/data.0/a.perf' (Data file is corrupted).
amplxe: Executing actions 50 % done
amplxe: Error: 0x4000001e (Cannot load raw collector data)
Although intel_pt data has not been successfully imported, the data for other kernel PMU events like "cpu-cycles" and "instructions" could be properly handled:
1- amplxe-perf record -o p.perf -T -e cpu-cycles,instructions -- ps
PID TTY TIME CMD
8410 pts/0 00:00:00 sudo
8458 pts/0 00:00:00 amplxe-perf
8467 pts/0 00:00:00 ps
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB p.perf (96 samples) ]
2- amplxe-cl -import p.perf -r r2
amplxe: Importing a new result 100 % done
amplxe: Using result path `/home/amad/r2'
amplxe: Executing actions 19 % Resolving information for `libprocps.so.6.0.0'
amplxe: Warning: Cannot locate debugging information for file `/lib/x86_64-linux-gnu/libprocps.so.6.0.0'.
amplxe: Executing actions 21 % Resolving information for `vmlinux'
amplxe: Warning: Cannot locate debugging information for the Linux kernel. Source-level analysis will not be possible. Function-level analysis will be limited to kernel symbol tables. See the Enabling Linux Kernel Analysis topic in the product online help for instructions.
amplxe: Executing actions 75 % Generating a report
Collection and Platform Info
----------------------------
Parameter r2
---------------- ------------------------------------
Operating System 4.15.0-13-generic
Computer Name amad-pc
Result Size 2766877
Collector Type Driverless Perf per-process sampling
CPU
---
Parameter r2
----------------- ----------
Frequency 3600000000
Logical CPU Count 16
Summary
-------
Elapsed Time: 0.011
Paused Time: 0.0
CPU Time: 0.011
Average CPU Utilization: 0.897
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------- ------------------------- -------------------------------- -----------------
cpu-cycles 40521584 45 4000
instructions 36302909 51 4000
amplxe: Executing actions 100 % done
What is wrong with Intel_pt data?

Altera Quartus falsly says Modelsim isn't installed

Installed Quartus 13.0 with Modelsim in Fedora 22 64-bit. Running Quartus in 32-bit because I get lots and lots of problems otherwise. However, I can start Quartus, create a project, synthesize it, fire up the simulation window and configure the in signals. Then, when clicking the button for launching Modelsim, it starts doing it's job, but ends up with
ModelSim-Altera was not found. Please install ModelSim-Altera which is included with the Quartus II installer, or use the Quartus II Simulator instead by selecting "Simulation > Options > Quartus II Simulator"
This is simply not true. I can start Modelsim myself by running vsim. Here follows the full output. Any suggestions to resolve this will be +1 and no suggestions which would make sense will be punished by me.
Device family: Cyclone II
Running quartus eda_testbench
>> quartus_eda --gen_testbench --check_outputs=on --tool=modelsim_oem --format=verilog grindar -c grindar {--vector_source=/home/johan/Projects/Studies/vhdl/labs/lab1/and_grind.vwf} {--testbench_file=./simulation/qsim/grindar.vt}
PID = 20951
*******************************************************************
Running Quartus II 32-bit EDA Netlist Writer
Version 13.0.1 Build 232 06/12/2013 Service Pack 1 SJ Web Edition
Processing started: Sat Sep 12 20:31:33 2015
Command: quartus_eda --gen_testbench --check_outputs=on --tool=modelsim_oem --format=verilog grindar -c grindar --vector_source=/home/johan/Projects/Studies/vhdl/labs/lab1/and_grind.vwf --testbench_file=./simulation/qsim/grindar.vt
Selected device EP2C35F672C6 for design "grindar"
Generated Verilog Test Bench File ./simulation/qsim/grindar.vt for simulation
Quartus II 32-bit EDA Netlist Writer was successful. 0 errors, 0 warnings
Peak virtual memory: 318 megabytes
Processing ended: Sat Sep 12 20:31:34 2015
Elapsed time: 00:00:01
Total CPU time (on all processors): 00:00:01
Running quartus eda_func_netlist
>> quartus_eda --functional=on --simulation --tool=modelsim_oem --format=verilog grindar -c grindar
PID = 20953
*******************************************************************
Running Quartus II 32-bit EDA Netlist Writer
Version 13.0.1 Build 232 06/12/2013 Service Pack 1 SJ Web Edition
Processing started: Sat Sep 12 20:31:36 2015
Command: quartus_eda --functional=on --simulation=on --tool=modelsim_oem --format=verilog grindar -c grindar
Selected device EP2C35F672C6 for design "grindar"
Generated file grindar.vo in folder "/home/johan/Projects/Studies/vhdl/labs/lab1/simulation/modelsim/" for EDA simulation tool
Quartus II 32-bit EDA Netlist Writer was successful. 0 errors, 0 warnings
Peak virtual memory: 318 megabytes
Processing ended: Sat Sep 12 20:31:37 2015
Elapsed time: 00:00:01
Total CPU time (on all processors): 00:00:01
*******************************************************************
ModelSim-Altera was not found. Please install ModelSim-Altera which is included with the Quartus II installer, or use the Quartus II Simulator instead by selecting "Simulation > Options > Quartus II Simulator"
Please check if the path to the ModelSim binary is correctly specified under
Tools -> Options
I am in Windows, but hopefully the settings should be the same under Linux

Qt on i.MX6 with -platform eglfs -> Segmentation fault

I have cross-compiled Qt 5.1.1 for an i.MX6 powered Nitrogen6x board running Debian 7 (wheezy).
I have configured Qt with the -egl parameter and eglfs has been listed as QPA backend in the configure output.
However if I try to run a small example application with the -platform eglfs parameter I am running into this error:
stdin: is not a tty
[ 1] HAL user version 4.6.9 build 6622 Aug 15 2013 13:22:40
[ 2] HAL kernel version 4.6.9 build 1210
QML debugging is enabled. Only use this in a safe environment.
bash: line 1: 3673 Segmentation fault DISPLAY=:0.0 /opt/Test/bin/Test -platform eglfs
Remote application finished with exit code 139.
OpenGL ES2 and EGL are installed on the board and can be found in /usr/lib and /usr/include.
Sadly I couldn't find proper documentation for eglfs, so I am hoping that someone around here has made some experiences with it.
This is the backtrace output:
run Test-platform eglfs
Starting program: /opt/Test/bin/Test Test -platform eglfs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
[ 1] HAL user version 4.6.9 build 6622 Aug 15 2013 13:31:17
[ 2] HAL kernel version 4.6.9 build 1210
QML debugging is enabled. Only use this in a safe environment.
[New Thread 0x2c6b7460 (LWP 4057)]
Program received signal SIGSEGV, Segmentation fault.
0x2bab6f48 in gcoHAL_QueryChipCount () from /usr/lib/libGAL.so
(gdb) backrace full
Undefined command: "backrace". Try "help".
(gdb) backrace full[1#t
#0 0x2bab6f48 in gcoHAL_QueryChipCount () from /usr/lib/libGAL.so
No symbol table info available.
#1 0x2ba7ccbc in veglGetThreadData () from /usr/lib/libEGL.so.1
No symbol table info available.
#2 0x2ba74cd0 in eglBindAPI () from /usr/lib/libEGL.so.1
No symbol table info available.
#3 0x2be41934 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
No symbol table info available.
#4 0x2be41934 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info registers
r0 0x1 1
r1 0x23e54 147028
r2 0x738 1848
r3 0x0 0
r4 0x2bb67d84 733379972
r5 0x23e18 146968
r6 0x2e70c 190220
r7 0x2b430198 725811608
r8 0x7efff9e8 2130704872
r9 0x8 8
r10 0x2b0725c4 721888708
r11 0x7efffae0 2130705120
r12 0x2bab6f1c 732655388
sp 0x7efff8f0 0x7efff8f0
lr 0x2ba7ccbc 732417212
pc 0x2bab6f48 0x2bab6f48 <gcoHAL_QueryChipCount+44>
cpsr 0x80000010 -2147483632
(gdb) x/16i $pc
=> 0x2bab6f48 <gcoHAL_QueryChipCount+44>: ldr r3, [r3, #12]
0x2bab6f4c <gcoHAL_QueryChipCount+48>: sub r2, r3, #1
0x2bab6f50 <gcoHAL_QueryChipCount+52>: cmp r2, #2
0x2bab6f54 <gcoHAL_QueryChipCount+56>: bhi 0x2bab6f70 <gcoHAL_QueryChipCount+84>
0x2bab6f58 <gcoHAL_QueryChipCount+60>: ldr r2, [r4]
0x2bab6f5c <gcoHAL_QueryChipCount+64>: mov r0, #0
0x2bab6f60 <gcoHAL_QueryChipCount+68>: str r3, [r1]
0x2bab6f64 <gcoHAL_QueryChipCount+72>: add r3, r2, #1
0x2bab6f68 <gcoHAL_QueryChipCount+76>: str r3, [r4]
0x2bab6f6c <gcoHAL_QueryChipCount+80>: pop {r4, pc}
0x2bab6f70 <gcoHAL_QueryChipCount+84>: mvn r0, #8
0x2bab6f74 <gcoHAL_QueryChipCount+88>: bl 0x2baad5fc
0x2bab6f78 <gcoHAL_QueryChipCount+92>: ldr r3, [r4]
0x2bab6f7c <gcoHAL_QueryChipCount+96>: mvn r0, #8
0x2bab6f80 <gcoHAL_QueryChipCount+100>: add r3, r3, #1
0x2bab6f84 <gcoHAL_QueryChipCount+104>: str r3, [r4]
(gdb) thread apply all backtrace
Thread 2 (Thread 0x2c6b7460 (LWP 4057)):
#0 0x2b52ef96 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0x2b568634 in _IO_file_close () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0x2b568ffe in _IO_file_close_it () from /lib/arm-linux-gnueabihf/libc.so.6
#3 0x2b56113a in fclose () from /lib/arm-linux-gnueabihf/libc.so.6
#4 0x2bea8d00 in udev_new () from /lib/arm-linux-gnueabihf/libudev.so.0
#5 0x2be7d2e4 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
#6 0x2be7d2e4 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0x2bcb9220 (LWP 4056)):
#0 0x2bab6f48 in gcoHAL_QueryChipCount () from /usr/lib/libGAL.so
#1 0x2ba7ccbc in veglGetThreadData () from /usr/lib/libEGL.so.1
#2 0x2ba74cd0 in eglBindAPI () from /usr/lib/libEGL.so.1
#3 0x2be41934 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
#4 0x2be41934 in ?? () from /usr/local/Qt-Debian/plugins/platforms/libqeglfs.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) quit
How could I possibly fix that error?
I have the exact same crash on a MarSBoard running an egl fb application on a Yocto image created with recipes from https://github.com/silmerusse/meta-robomind.
I had to copy the EGL/OpenGL related stuff from http://repository.timesys.com/buildsources/g/gpu-viv-bin-mx6q/.
In my case galcore.ko is builtin.
Edit:
Check that you have /dev/galcore and that its permission are crw.rw.rw. (otherwise sudo chmod 666 /dev/galcore).
If you don't have /dev/galcore, try insmod /lib/modules/..../kernel/drivers/mxc/gpu-viv/galcore.ko.
These steps fixed the crash for me on an ubuntu image.
On the Yocto image the galcore driver is builtin, and seems to be there but I still get the crash.
Edit:
The crash in the Yocto image was caused by the wrong version of the EGL/GAL.so libs. Apparently the galcore driver built into the kernel has version 4.6.9.6622. It requires libs from gpu-viv-bin-mx6q-3.0.35-4.1.0. Using those libs and manually copying them into /usr/lib my fb application runs fine, using hardware OpenGLES 2.0 and hardware decoding of a h264 video.
I fixed this issue by switching to Yocto and therefor gaining access to most recent releases of essential components.
If you're developing for an i.MX cpu I strongly recommend to have a look at https://github.com/Freescale/fsl-community-bsp-platform
It is very important to remove "x11" from default-distrovars and "wayland" from poky.conf as these will cause you to run into errors.
Building Qt5 on such a setup works fine.
I got similar segfault when I forgot to load the galcore module. Here is the backtrace:
#0 0x766062b0 in gcoHAL_QueryChipCount (Hal=Hal#entry=0x0, Count=Count#entry=0x16494)
at gc_hal_user_query.c:1726
#1 0x766da244 in veglGetThreadData () at gc_egl.c:137
#2 0x766d3210 in eglfGetDisplay (display_id=0x16c08) at gc_egl_init.c:464
#3 eglGetDisplay (DisplayID=0x16c08) at gc_egl_init.c:565
Qt 5.3.2, kernel 3.10.17, Galcore version 4.6.9.9754

Resources