Clearing NFS Hang-ups on the MHPCC Clusters

We occasionally see issues where the nfs server on one machine and the client on another machine develop some communication troubles and cause various kinds of hang-ups and glitches. This pages discusses the types of issues and how to recover.

Hung 'df'

Occasionally, on the nfs client, the command 'df' will hang-up on one of the entries. Eventually it will time out, reporting the machine for which 'df' had a problem. In this case, it is usually possible to clear out the problem on the client by unmounting the sticking nfs mount point. To find the mount point of interest without waiting for the NFS timeout to occur, follow these simple steps:

  • issue 'df' and ctrl-c when it hangs
  • cat /etc/mtab and look for the first entry not listed by your nfs -- there is your problem mount point
  • umount filesystem (eg, umount /data/ipp009.0)

Ghost automount entries

Occasionally, the nfs client will have entries in the automount directory that are not listed in 'df'. This can happen and not cause problems if the nfs entry was somehow misplaced in /etc/mtab. However, sometimes the directory will hang on any interactions. In these cases, a simple umount of the filesystem usually does not fix the problem. Sometimes umount -f may fix it, but my success this was has been sporadic. The following script seems to be generally successful (though sometimes it takes 2 or 3 tries of the script). In this example, the nfs partition is coming from ipp009:/export/ipp009.0 and is automounted to /data/ipp009.0.

mkdir -p /tmp/foo

umount -f -v /data/ipp009.0
mount ipp009:/export/ipp009.0 /tmp/foo
umount /tmp/foo

ls /data/ipp009.0

(Note that the last step is only to demonstrate that the process worked).

NFS hang-ups and machine behavior

The image below shows the 1-minute load reported by Ganglia for (part of) the MHPCC cluster for a certain period. In this period, a number of machines had nfs problems with /data/ipp009.0. Those same machines are the ones for which the load has a regular 10-minute spike from low to high load levels. It seems that the existing nfs check script is triggering these load spikes by interacting with /data/ when some of the entries in /data/ were causing nfs problems.