News: Downtime due to Kernel Issue - Progress Report and Update

Published: 20/10/2023

Status (UP):


- 10.03am


As of 10am this morning, we received a power surge alert on one of our server. This in turn force shut down the server in an unexpected way. We cannot detect where the power surge is coming from right now, but our priority to get back the server running. Engineer is disatched to the data center CJ1.


11.30am


Our engineer has tried booting the server for more than an hour by checking all fstab and any possible issues but the server kept having problems booting up past dracut.


 


12.30pm


We have opened up the server to isolate if there is any issues with ram or motherboard.


 


1.30pm


As far as all the test and pluging in and out of all 512gb of our ram, it seems that no ram is faulty. We will continue to test the hard disk.


 


2.30pm


We are running the xfs_repair on all hard disk, this will take some time.


 


4.30pm


As of now, all hard disk is checked and working fine, no issues on superblock failed or any read/write failure. Our engineer can also access it via rescue mode. As of now, we are refocusing back our effort on etc/fstab and try to repair grub boot loader.


 


7.30am


Grub is fully repaired, etc/fstab is also reconfigure to be more resilient. The server is up and running now. We are looking into the reason why power surge is happening and which hardware is affected. we have also started a backup plan to migrate the server just in case the power surge happens again.