Debugging a boot crash on Linux 7.0 rc1 and rc2 (Strix Halo)

The starting point

I enjoy tinkering with things, and yes, that includes debugging Linux kernels in my free time. My daily driver is an HP ZBook Ultra 14 G1a with an AMD Ryzen AI Max+ PRO 395 (Strix Halo), running CachyOS. On kernel 6.19 I had been dealing with flaky s2idle, sometimes it would suspend fine, sometimes it wouldn't. Then HP pushed BIOS update 01.04.05 Rev.A, which shipped newer NPU firmware bumped to major version 7.

Here's the problem: the in-tree amdxdna driver on 6.19 hardcodes protocol_major = 0x6 in npu5_regs.c and flat-out rejects anything else. If the firmware reports major 7, the driver bails with aie2_check_protocol: Incompatible firmware protocol major 7. The fix landed in 7.0-rc1 via commit f1eac46fe5f7 ("accel/amdxdna: Update firmware version check for latest firmware"), which reworks the version check to accept newer firmware. Combined with several sleep/resume improvements also queued for 7.0, I had plenty of reasons to try the new kernel.

So I installed 7.0-rc2 from the CachyOS Kernel Manager and rebooted.

Black screen. Nothing.

I rebooted again, edited the cmdline in limine, stripped out splash and quiet, added debug. Now I could see where it choked: in amdgpu, then silence.

amdgpu 0000:c3:00.0: initializing kernel modesetting (IP DISCOVERY 0x1002:0x1586 0x103C:0x8D01 0xD1)
amdgpu 0000:c3:00.0: register mmio base: 0xA0300000
amdgpu 0000:c3:00.0: register mmio size: 1048576
amdgpu 0000:c3:00.0: detected ip block number 0 <common_v1_0_0> (soc21_common)
amdgpu 0000:c3:00.0: detected ip block number 1 <gmc_v11_0_0> (gmc_v11_0)
amdgpu 0000:c3:00.0: detected ip block number 2 <ih_v6_0_0> (ih_v6_1)
amdgpu 0000:c3:00.0: detected ip block number 3 <psp_v13_0_0> (psp)
amdgpu 0000:c3:00.0: detected ip block number 4 <smu_v14_0_0> (smu)
amdgpu 0000:c3:00.0: detected ip block number 5 <dce_v1_0_0> (dm)
amdgpu 0000:c3:00.0: detected ip block number 6 <gfx_v11_0_0> (gfx_v11_0)
amdgpu 0000:c3:00.0: detected ip block number 7 <sdma_v6_0_0> (sdma_v6_0)
amdgpu 0000:c3:00.0: detected ip block number 8 <vcn_v4_0_5> (vcn_v4_0_5)
amdgpu 0000:c3:00.0: detected ip block number 9 <jpeg_v4_0_5> (jpeg_v4_0_5)
amdgpu 0000:c3:00.0: detected ip block number 10 <mes_v11_0_0> (mes_v11_0)
amdgpu 0000:c3:00.0: detected ip block number 11 <vpe_v6_1_0> (vpe_v6_1)
amdgpu 0000:c3:00.0: detected ip block number 12 <isp_v4_1_1> (isp_ip)
amdgpu 0000:c3:00.0: Fetched VBIOS from VFCT
amdgpu 0000:c3:00.0: [drm] ATOM BIOS: 113-STRXLGEN-001
amdgpu 0000:c3:00.0: VPE: collaborate mode true

Searched around online, found nothing useful, and called it a night.

But the thought kept nagging me. I really wanted that NPU running with a local LLM.

Next day, I headed to the drm/amd GitLab and filed an issue. Searched through the existing ones first, found nothing similar. Naturally, the moment I submitted mine, I spotted an existing one describing the exact same problem on the exact same laptop. Classic. So I commented on that one instead.

That issue had something mine didn't: the full kernel oops stacktrace. I figured this was interesting enough to dig into properly, so I cloned the CachyOS kernel source and started getting my hands dirty.

What followed was a rabbit hole of three separate bugs, all triggered by one seemingly innocent commit. Here's how each one unfolded.


Problem 1: Boot crash (NULL pointer dereference)

The kernel oops looked like this:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 6 UID: 0 PID: 1607 Comm: (udev-worker) Tainted: G  OE  7.0.0-rc2-1-MANJARO
Hardware name: HP HP ZBook Ultra G1a 14 inch Mobile Workstation PC/8D01,
               BIOS X89 Ver. 01.04.05 01/19/2026
RIP: 0010:isp_genpd_add_device+0x25/0xc0 [amdgpu]
Call Trace:
 device_for_each_child+0x71/0xb0
 isp_v4_1_1_hw_init+0x42b/0x4a0 [amdgpu]
 amdgpu_device_init.cold+0x19de/0x250f [amdgpu]
 amdgpu_driver_load_kms+0x19/0x80 [amdgpu]
 amdgpu_pci_probe+0x233/0x480 [amdgpu]
 local_pci_probe+0x3e/0x90
 pci_device_probe+0xe1/0x260
 really_probe+0xde/0x380

See https://gitlab.freedesktop.org/drm/amd/-/issues/5021

The ISP (image signal processor) driver uses device_for_each_child() to iterate over child devices and calls strcmp(dev->type->name, "mfd_device") on each one, filtering for MFD children. The reporter of that issue, @iglooom, bisected the culprit. Commit 057edc58aa5 moved ACPI wakeup source registration from ACPI devices to physical devices, which added new child devices under the ISP's parent. These new children don't have dev->type set at all, so dereferencing dev->type->name hits a NULL pointer and crashes the kernel before you ever see a login screen.

Fix: a two-line NULL check.

if (!dev->type || !dev->type->name) {
    return 0;
}

I patched this into the CachyOS kernel, built a new package, installed it, rebooted. The crash was gone, I got to desktop.

But my webcam was dead.

Problem 2: Webcam module not autoloading

Running lsmod | grep isp showed nothing. The amd_isp4_capture module simply wasn't being loaded. On a hunch I ran modprobe amd_isp4_capture manually, and to my surprise the webcam sprang to life. So the module itself was fine, it just wasn't being picked up automatically.

I checked the device's modalias:

$ cat /sys/bus/platform/devices/amd_isp_capture/modalias
acpi:LNXVIDEO:

That's wrong. It should be platform:amd_isp_capture. The kernel's module autoloading uses this string to match devices to drivers, and with LNXVIDEO as its identity it was trying to load an ACPI video driver instead.

Root cause

The ISP driver creates three sub-devices through the MFD (Multi-Function Device) framework: amd_isp_capture, amd_isp_i2c_designware, and amdisp-pinctrl. MFD has a helper called mfd_acpi_add_device() that tries to match each child to a specific ACPI node, and if no match info is provided, falls back to assigning the parent's ACPI companion via device_set_node(&pdev->dev, acpi_fwnode_handle(adev ?: parent)). That's fine when the parent is an actual ACPI-described device that the children belong to.

But here, the ISP's parent is the GPU, a PCI device. So all three camera sub-devices inherited the GPU's ACPI identity (GFX0/LNXVIDEO). This caused a cascade of issues: the ACPI video driver binding to ISP children and creating phantom "Video Bus" input devices, and the wrong modalias killing autoloading.

First attempt: modify MFD core

The obvious approach: don't assign ACPI companions in mfd_acpi_add_device() when the child has no acpi_match table. Built that, rebooted, got stuck at boot with pcie_mp2_amd complaining about "unknown laptop placement".

Turns out the AMD MP2 sensor hub also uses MFD without acpi_match, but it depends on inheriting the parent's ACPI companion. Touching MFD core broke it.

Second attempt: fix it in the ISP driver

Instead of modifying the shared MFD code, I temporarily hid the parent's firmware node right around the mfd_add_hotplug_devices() call:

fwnode = isp->parent->fwnode;
isp->parent->fwnode = NULL;

r = mfd_add_hotplug_devices(isp->parent, isp->isp_cell, 3);

isp->parent->fwnode = fwnode;

With no firmware node visible on the parent, mfd_acpi_add_device() returns early and the children keep their correct platform:amd_isp_capture modalias.

Built, rebooted. And immediately got another kernel oops. Great.

Problem 3: I2C controller crash

I booted back into 6.19 and pulled up the logs from the failed boot with journalctl -b -1 -k. The good news buried in there: the modalias fix had worked, amd_isp4_capture was autoloading. The bad news was a fresh NULL pointer dereference right after:

BUG: kernel NULL pointer dereference, address: 0000000000000204
RIP: 0010:regmap_read+0x9/0x70

Call trace:

regmap_read
__i2c_dw_disable
i2c_dw_init
amd_isp_dw_i2c_plat_runtime_resume
genpd_runtime_resume
__rpm_callback
rpm_resume
__pm_runtime_resume
amd_isp_dw_i2c_plat_probe

Root cause

The I2C driver's probe function does this:

static int amd_isp_dw_i2c_plat_probe(struct platform_device *pdev)
{
    // ... setup ...

    pm_runtime_enable(&pdev->dev);
    pm_runtime_get_sync(&pdev->dev);       // triggers runtime resume!

    ret = i2c_dw_probe(isp_i2c_dev);       // regmap created here via i2c_dw_init_regmap()
}

pm_runtime_get_sync() immediately triggers a "resume", which calls amd_isp_dw_i2c_plat_runtime_resume(). That function calls i2c_dw_init(i_dev), which tries to access the I2C controller's register map. But the regmap doesn't exist yet, it gets created later in i2c_dw_probe() via i2c_dw_init_regmap(). The resume callback assumes the hardware has been fully initialized before, which is true after a real suspend, but not during the very first probe.

Fix: guard the resume path.

i2c_dw_prepare_clk(i_dev, true);

if (!i_dev->map)
    return 0;

i2c_dw_init(i_dev);

If the regmap hasn't been created yet, skip re-initialization. The real setup happens moments later in i2c_dw_probe() anyway.

The result

With all patches applied, the webcam works correctly on 7.0-rc2. Module autoloading, no crashes, s2idle suspend/resume all solid.

I'll admit, I was nervous posting these workarounds. I have little experience with kernel development and kept second-guessing whether I was making rookie mistakes. But the AMD developer investigating the bug on the GitLab issue responded, confirmed the cause, and is working on a proper upstream fix. Let's hope this lands in rc4.

I'm now running 7.0-rc2 as my daily driver. s2idle is excellent, and the NPU works beautifully with FastFlowLM.