OVH Community, your new community space.

OVH rescue - kernel panic


RichardWnl
20-08-2013, 12:55
Quote Originally Posted by yonatan
on it's face it seems your raid array, which holds your root device is not up at boot.

your /dev/sda drive is actually toast from the last output you provided.
It is a long-shot, but if no actual damage was done to the array you might be able to reassemble it,
Perhaps you don't have your root device UUID configured on your boot loader, and initrd might not be updated correctly, you might want to look into grub documentation about how to set the kernel boot-up options.

if you do have backups, and you are able to reformat , firstly send a ticket with the output of smartctl on your manager interface and have that uncorrect drive replaced.

I would advice not going with raid0 for your root device and reformat your machine with raid mirror for your root device, as davidhogan suggested.

or wait for next week to spin up a new mSP and load backups on a fresh server...
Yes, I'm not using the UUID:
cat /boot/grub/grub.conf
title CentOS (2.6.18-348.12.1.el5.centos.plus)
kernel /vmlinuz-2.6.18-348.12.1.el5.centos.plus root=/dev/md2 ro
root (hd0,0)
Same story for fstab:
cat /etc/fstab
#
/dev/md2 / ext3 errors=remount-ro,defaults,usrquota 0 0
/dev/md1 /boot ext3 errors=remount-ro 0 1
/dev/sda3 swap swap defaults 0 0
/dev/sdb3 swap swap defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts defaults 0 0
/usr/tmpDSK /tmp ext3 defaults,noauto 0 0
I guess I could give that a try, didn't know it could have such an impact.

And looking back only putting /boot in RAID 1 is dumb. But at the time I had some problems with the root device filing up at occasions. Will make sure it doesn't happen again

Right now I just want to get things back up and running asap. After that I guess I can slowly migrate to the mSP.

yonatan
20-08-2013, 12:08
Quote Originally Posted by RichardWnl
Hmm some more information while the server is booting:

This pulled my attention:


When I looked inside /var/log/messages I was able to find this error, which is before the server stopped working:


So I'm now forcing the drive to perform a long-type surface scan:


But for some reason I'm still thinking about the missing modules, yonatan was talking about. I've tried rebuilding the initrd image with no luck so far. And also compiled a new kernel, but that doesn't seem to do the trick either.
on it's face it seems your raid array, which holds your root device is not up at boot.

your /dev/sda drive is actually toast from the last output you provided.
It is a long-shot, but if no actual damage was done to the array you might be able to reassemble it,
Perhaps you don't have your root device UUID configured on your boot loader, and initrd might not be updated correctly, you might want to look into grub documentation about how to set the kernel boot-up options.

if you do have backups, and you are able to reformat , firstly send a ticket with the output of smartctl on your manager interface and have that uncorrect drive replaced.

I would advice not going with raid0 for your root device and reformat your machine with raid mirror for your root device, as davidhogan suggested.

or wait for next week to spin up a new mSP and load backups on a fresh server...

RichardWnl
20-08-2013, 11:48
Hmm some more information while the server is booting:

This pulled my attention:
RAID0: too few disks (1 of 2) - aborting
When I looked inside /var/log/messages I was able to find this error, which is before the server stopped working:
Aug 15 21:23:54 ovh smartd[6149]: Device: /dev/sda [SAT], 537 Currently unreadable (pending) sectors
Aug 15 21:23:54 ovh smartd[6149]: Device: /dev/sda [SAT], 537 Offline uncorrectable sectors
So I'm now forcing the drive to perform a long-type surface scan:
smartctl -t long /dev/sda
But for some reason I'm still thinking about the missing modules, yonatan was talking about. I've tried rebuilding the initrd image with no luck so far. And also compiled a new kernel, but that doesn't seem to do the trick either.

davidhogan
19-08-2013, 18:39
You could also backup your data and reinstall the server mount the the drive/s your data is stored on and use sftp to download you files. You can do this in rescue mode.

RichardWnl
19-08-2013, 01:31
It's a 2011 EG-10R BestOF server
Specs: Intel Xeon W3520 CPU / 12 GB DDR3 RAM / 2x 1.5 TB SATA2 HDD

The server uses Soft Raid


cat /proc/mdstat


I also tried to compile a new kernel with some help from here:
http://forum.ovh.co.uk/showthread.php?t=2056

Result:


What goes for the missing modules I've had a look in here (default kernel that is being used):

But I'm not really sure if I'm looking in the right directory and what it is that I should pay attention to.

yonatan
18-08-2013, 03:20
bootup in rescue-pro and install a new kernel.

what type of server is it ?
do you have a hardware raid card or raid at all?
this might be caused due to lack of modules on boot.

RichardWnl
18-08-2013, 02:12
Since yesterday one of my servers keeps getting booted into rescue mode.

One of OVH's technicians commented:
diagnosis:
server don't boot on hard disk (kernel panic error)
not boot on bzimage
server put on rescue mode
boot ok, ping ok
hard disks ok
I have tested the hardware (CPU/RAM/HDD) and it all seems OK, no errors found. The server is running CentOS 5 32bit with cPANEL installed on it. Since it hasn't been touched for some time I assume cPanel updated something that made the server go down. But that's just a guess.

Tried all sorts of things to fix it with no luck. So maybe you guys have an idea where I need to look?

This is what it looks like when I start the server with vKVM (using VNC method):