News: Downtime due to Kernel Issue - Progress Report and Update
Status (UP):
- 10.03am
As of 10am this morning, we received a power surge alert on one of our server. This in turn force shut down the server in an unexpected way. We cannot detect where the power surge is coming from right now, but our priority to get back the server running. Engineer is disatched to the data center CJ1.
11.30am
Our engineer has tried booting the server for more than an hour by checking all fstab and any possible issues but the server kept having problems booting up past dracut.
12.30pm
We have opened up the server to isolate if there is any issues with ram or motherboard.
1.30pm
As far as all the test and pluging in and out of all 512gb of our ram, it seems that no ram is faulty. We will continue to test the hard disk.
2.30pm
We are running the xfs_repair on all hard disk, this will take some time.
4.30pm
As of now, all hard disk is checked and working fine, no issues on superblock failed or any read/write failure. Our engineer can also access it via rescue mode. As of now, we are refocusing back our effort on etc/fstab and try to repair grub boot loader.
7.30am
Grub is fully repaired, etc/fstab is also reconfigure to be more resilient. The server is up and running now. We are looking into the reason why power surge is happening and which hardware is affected. we have also started a backup plan to migrate the server just in case the power surge happens again.