News: Downtime due to Kernel Issue - Progress Report and Update
As of 10am this morning, we received a power surge alert on one of our server. This in turn force shut down the server in an unexpected way. We cannot detect where the power surge is coming from right now, but our priority to get back the server running. Engineer is disatched to the data center CJ1.
Our engineer has tried booting the server for more than an hour by checking all fstab and any possible issues but the server kept having problems booting up past dracut.
We have opened up the server to isolate if there is any issues with ram or motherboard.
As far as all the test and pluging in and out of all 512gb of our ram, it seems that no ram is faulty. We will continue to test the hard disk.
We are running the xfs_repair on all hard disk, this will take some time.
As of now, all hard disk is checked and working fine, no issues on superblock failed or any read/write failure. Our engineer can also access it via rescue mode. As of now, we are refocusing back our effort on etc/fstab and try to repair grub boot loader.
Grub is fully repaired, etc/fstab is also reconfigure to be more resilient. The server is up and running now. We are looking into the reason why power surge is happening and which hardware is affected. we have also started a backup plan to migrate the server just in case the power surge happens again.