NFS ‘fix’ in kernel costs me a day

Migrated: originally published 2007-11-05

Linux updates are riskier than I thought. This one cost me a day of work, plus about a half day for one of our developers.

Symptoms:

A developer on one of our developer boxes couldn’t do builds anymore. When he logged in, he couldn’t see the contents of the toolchain folder.

It turned out that I when I did an update to Linux, I had unknowingly updated the kernel, and the new kernel (2.22.something) had an NFS fix in it. The problem being fixed was as follows. If you have two NFS mounts from the same partition on a server, with different options, this could result in being able to do something to a file or folder on one mount that you are not allowed to do on the other mount. This can apparently cause NFS problems, not just security problems, so the newer kernels don’t allow you to mount two NFS shares from one host at once if they have different options, like read-write on one and read-only on the other. However, they do provide an option you can set in the mount options to tell the kernel to ignore the differences.

The solution I actually used was to set both NFS mounts to read-write and depend on the “ro” option on the server to keep people from writing to the tool chains. The mount options for automounted shares are provided by NIS from the server, so they are the same on several boxes. And the new option would probably break the mount command on the older boxes.

So that’s what I eventually found out. But first, picture me unmounting one share, then mounting the other, and not able to mount the first one. Then picture me unmounting the second one, successfully mounting the first one, and now unable to mount the second one. Now consider that a few weeks ago I had been making changes to configuration options for “autofs” (auto-mount), and testing them by restarting the individual processes. The machine had not been rebooted for 6 months, so I thought the reboot after the update had brought out some error in the configuration. Yup, that’s what cost me most of my debug time.

I finally looked at the updates for possible problems, using an rpm option “–last” to sort the updates by when they were installed. I actually looked for “kernel” because I had found out by then that autofs and nfs are both provided by the kernel.

After I looked for something like my problem in the last month of emails and in the Red Hat Bugzilla, I filed a bug report. And got a one line response, in about an hour. That’s when I found out about the new NFS setting, from which I found out about the kernel NFS fix above, when I read “man nfs”.

Lessons learned:
1) “rpm” option: “–last”. Actually, I have used this before when investigating recent updates. The command I used is:

rpm -qa –last | less

2) Don’t believe everything you read in a man page. “man nfs” says that the fix was made in 2.18. The old kernel on the box is 2.20.something before the update, and it never complained about the differences in options. One possible explanation is that Red Hat sometimes doesn’t take certain fixes when they first come out.

3) I shoulda oughta really shoulda stopped yum from downloading kernel updates automatically. The fix turns out to be easy – edit “/etc/yum.conf” and uncomment the line that looks something like “# exclude=kernel kernel-devel”. I will fix that on the other boxes right away.

4) Having the developers use basically interchangeable Linux development boxes that mount the tool chains and the developer workspace from a central server makes it relatively easy to move them off a box that has problems.

5) I will subscribe to more mail lists, like nfs or autofs, but not the high-traffic developer lists. I already subscribe to Bugzilla, Buildbot and Amanda lists.

6) And the really tough one. Make the developer boxes all run the same distro, one that provides several years of updates, like RHEL, Centos or Debian, but not Fedora Core. (Actually our office is limited to Red Hat type distros for technical reasons, so Debian is out.) Set up some tests to make sure that “everything” is working on a developer box. And test updates on one box before rolling it out to the others. Preferably test the updates one at a time.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s