Bisecting Fedora kernel

This post shows how to bisect a Fedora kernel to find the source of a regression. I needed that recently and I found no good guide, so I’m at least capturing my notes here, perhaps you find it useful. This approach can be used to identify which exact commit caused a bad kernel behavior on your hardware, and then report it to kernel maintainers. Note, you need to have a reliable way of reproducing the problem. If it happens randomly and infrequently, it’s much harder to debug.

0. Try the latest Rawhide kernel

Before you spend too much time on this, it’s always worth a shot to test the latest Rawhide kernel. Perhaps the bug is fixed already?

Usually the kernel consists of these installed packages: kernel, kernel-core, kernel-modules, kernel-modules-core, kernel-modules-extra. But see what you have installed on your system, e.g. with: rpm -qa | grep ^kernel | sort .

Install the latest Rawhide kernel:

sudo dnf update --setopt=installonly_limit=0 --repo fedora --releasever rawhide kernel{,-core,-modules,-modules-core,-modules-extra}

You want to use --setopt=installonly_limit=0 throughout this exercise to make sure you don’t accidentally remove a working kernel from your system and don’t end up with just broken ones (there’s a limit of three kernels installed at the same time by default). But it means you’ll need to remove tested kernels manually from time to time, otherwise you run out of space in /boot.

Reboot and keep pressing F8 during startup to display the GRUB boot menu. Make sure to select the newly installed kernel, boot it, test it. Note down whether it’s good or bad. If the problem is still there, we’ll need to continue debugging.

Note: When you want to remove that tested kernel, obviously you can’t be currently running from it. Then use standard dnf remove to get rid of it, or use dnf history for a more convenient way (e.g. dnf history undo last).

I. Narrow down the issue in Fedora-packaged kernels

As the first step, it’s useful to figure out which Fedora-packaged kernel is the last one with good behavior (a “good kernel“), and which one is the first one with bad behavior (a “bad kernel“). That will help you narrow down the scope. It’s much faster to download and install already built kernels than to compile your own (which we’ll do later).

Most probably you’re currently running a bad kernel (because you’re reading this). So reboot, display the GRUB boot menu and boot an older kernel. See if it’s good or bad, note it down. Unless the problem is very recent, all available kernels (usually three) in the GRUB menu will be bad. It’s time to start downloading older kernels from Koji. Use a reasonable strategy, e.g. install a month old kernel, or several months old, and gradually halve the intervals and narrow down until you find the latest good kernel. You don’t need to worry about using kernels from other Fedora releases (as you can see in their .fcNN suffix), they are standalone and work in any release. You can download the kernel subpackages manually, or use koji command (from the koji package), e.g.:

koji download-build --arch x86_64 kernel-6.5.0-0.rc6.43.fc39

That downloads many more subpackages than you need, so install just those needed (see the previous section), e.g. like this:

sudo dnf --setopt=installonly_limit=0 install ./kernel{,-core,-modules,-modules-core,-modules-extra}-6.5*.rpm

For each picked kernel, install it, boot into it, test it, note down whether it’s good or bad. Continue until you’ve found the latest good packaged kernel and the first bad packaged kernel.

II. Find git commits used for building identified good and bad kernels

Now that you have the closest good and bad packaged kernel, we need to figure out which git commits from the upstream Linux kernel were used to build them. In some cases, the git commit hash is included directly in the RPM filename. For example in my case, I reported that kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 is the last good kernel, and kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39 is the first bad kernel. From those filenames, you can see that git commit 6e98b09da931 is good and git commit 33afd4b76393 is bad.

Not always is the commit hash part of the filename, e.g. with the example of kernel-6.5.0-0.rc6.43.fc39. In this case, you need to download the .src.rpm file from that build. Either manually from Koji, or using:

koji download-build --arch src kernel-6.5.0-0.rc6.43.fc39

Unpack that .src.rpm (my favorite decompress tool is deco), find linux-*.tar.xz archive and run the following command (adjust the archive filename):

$ xzcat -qq linux-6.5-rc6.tar.xz | git get-tar-commit-id
2ccdd1b13c591d306f0401d98dedc4bdcd02b421

(This command is documented in the kernel.spec file, also in that directory). Now you know the git commit hash used for that kernel build. Figure out commits for both the good and bad kernel you identified.

III. Use git bisect to find the exact commit that broke it

It’s time to clone the upstream Linux kernel repo:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/src/linux

And also the Fedora distgit kernel repo:

fedpkg clone -a ~/distgit/kernel

We’ll now use git bisect to arrive at the breaking commit which caused the problem. After each step, we’ll need to build the kernel, test it, and mark it as good or bad. Let’s start:

cd ~/src/linux
git bisect start
git bisect good YOUR_GOOD_COMMIT
git bisect bad YOUR_BAD_COMMIT

Git now prints a commit hash to be tested (and switches the repository to that commit), and an estimate of how many steps remain. We now need to take the current contents of the source code and build our own kernel.

Note: When building the kernel, I was advised to avoid the overhead of packaging, to speed up the process. I’m sure it’s a good advice, but I didn’t find a good guide on how to do that (including how to retrieve the Fedora kernel config, build the kernel manually, copy it to the right places, create initramfs, create a boot option in GRUB, etc). So I just ran the whole process including packaging. On my machine, the compilation took about 40 minutes and packaging took 10 minutes, and I needed to do about 11 rounds, so it was an OK tradeoff for me. (If you can write a guide how to do that without packaging, please do and link it in the comments, I’d love to read it).

Let’s create a tarball of the current source code like this:

git archive --prefix=linux-local/ HEAD | xz -0 -T0 > linux-local.tar.xz

Usually the tarballs have a version number in both the filename and the included directory (which is then also matched in a spec file). You can do that if you wish, I didn’t want to spend too much time on throwaway builds, so I just used a static filename and overwrote it each time.

Let’s move the tarball to the distgit repo:

mv ~/src/linux/linux-local.tar.xz ~/distgit/kernel/

Now we need to adjust the distgit spec file a bit:

cd ~/distgit/kernel
# edit kernel.spec

I made the following changes to the spec file:

-# define buildid .local
+%define buildid .local
-%define specrpmversion 6.4.9
+%define specrpmversion 6.4.0
-%define specversion 6.4.9
+%define specversion 6.4.0
-%define tarfile_release 6.4.9
+%define tarfile_release local
-%define specrelease 200%{?buildid}%{?dist}
+%define specrelease 0.gitYOUR_TESTED_COMMIT%{?buildid}%{?dist}

Now we can start the build:

nice fedpkg mockbuild --with baseonly --with vanilla --without debuginfo

Options --with baseonly and --without debuginfo make sure we don’t build unnecessary stuff. --with vanilla was needed, because Fedora-specific patches didn’t apply to the older source code.

After a long time, your results should be available in results_kernel/ and look something like this:

$ ls -1 results_kernel/6.4.0/0.git6e98b09da931.local.fc38/
build.log
hw_info.log
installed_pkgs.log
kernel-6.4.0-0.git6e98b09da931.local.fc38.src.rpm
kernel-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-core-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-devel-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-devel-matched-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-core-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-extra-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-internal-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-uki-virt-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
root.log
state.log

See that all the RPMs have the git commit hash identifier that you specified in the spec file. Now you just need to install the kernel (see in a previous section), boot it (make sure to display the GRUB menu and verify that the correct kernel is selected), and test it.

Note: If you have Secure Boot enabled, you’ll need to disable it in order to boot your own kernel (or figure out how to sign it yourself). Don’t forget to re-enable it once this is all over.

Once you’ve determined whether this kernel is good or bad, tell it to git bisect:

cd /src/linux
git bisect good   # or bad

And now the whole cycle repeats. Create a new archive using git archive, move it to the distgit directory, adjust the specrelease field in kernel.spec to match the new commit hash, and use fedpkg to build another kernel. Eventually, git bisect will print out the exact commit that caused the problem.

IV. Report your findings

Report the problem and the identified breaking commit into Red Hat Bugzilla under the kernel component. Please also save and attach the bisect log:

cd /src/linux
git bisect log > git-bisect-log.txt

Then also report this problem (possibly a regression) to the kernel upstream and mention it in the RH Bugzilla ticket. Thanks and good luck.

Leave a Reply (Markdown syntax supported)