I had an interesting issue develop today. I was asked to help with a Solaris 10 system that failed to come up after a reboot, or rather, was unreachable remotely after a reboot. The kernel answered to a ping but ssh failed to respond. Fortunately I was able to string a console cable to a laptop and took a look at what was going on. Listing the services and grepping for ssh
# svcs | grep ssh
showed ssh failed to come online. I tried to restart it without success but no messages about why.
# svcadm restart ssh
Doing an check of dependencies
# svcs -d ssh
and a detailed check on the service
# svcs -xv ssh
showed that the filesystem/local:default service was failing to come up. Hmmmm, doing a df -k seemed ok….
So I did the next logic check of the filesyste/local:default service:
# svcs -d filesystem/local:default
# svcs -xv filesystem/local:default
It didn’t report any obvious causes but there was the suggestion to go look in the service startup log file – which I did
# view /var/svc/log/system-filesystem-local:default-log
and at the end of the file entries showed failed attempts to mount several (database) partitions due to an error in the vfstab file.
This gave me the clue I needed – the vfstab file had some bogus options on each of the database mount points listed. I commented out all the bogus lines and rebooted the server… that was easier than going through all the svcadm disable and enable steps …
The server came up and ssh was back online! We fixed the vfstab entries properly and everything was back in order.
Now, I have a problem with all of this. In my mind this is a major weakness of Solaris 10 with the SMF routines. An improper entry in the vfstab for *any* mount point – even non-essential ones can wedge the administrative remote access protocol – leaving you with no other choice but to have a console terminal access (which every one knows you *should* have anyway). This seems like a weakness in the design – am I wrong here?