Bricked ECS After Migration Troubleshooting

V1.0 – July 2024

Version Author Description
V1.0 – 2024-07-30 Diogo Hatz 50037923 Initial Version
V1.0 – 2024-07-30 Wisley da Silva 00830850 Document Review

Introduction

SMS is a virtual machine migration service provided by Huawei Cloud. With this service, you can migrate VMs from other cloud providers or from on-premises environments to the cloud. SMS migrates virtual machines to ECSs, which correspond to the virtual machine service in Huawei Cloud.

This document aims to present a solution for VMs migrated using the SMS migration service in which it is not possible to access the machine through “remote login”, via the console, or via remote access via protocols such as SSH.

Considerations

Important: It is possible for several different factors to cause ECSs to freeze after they are migrated via SMS. This document will address the issue of compatibility of certain versions of cloud-init with the ECS service, which is one of the factors that can cause ECS to freeze.

Symptoms

When trying to access the ECS via remote login or SSH, the following errors occur:

Temporary ECS

Since the ECS cannot be accessed, you will need to remove your system disk and attach it to a temporary ECS in order to access its boot menu. To do this, first create a temporary ECS with the same operating system and same AZ as the frozen machine. After that, remove the system disk from the frozen ECS and place it in the temporary ECS as a data disk. Removing the system disk from the frozen ECS:

Attaching the system disk from the frozen ECS to the temporary ECS as a data disk data:

After that, remotely access the temporary ECS and use the “fdisk -l” command to list the disks attached to the machine.

When you identify the disk that was attached to the temporary ECS, mount the disk with the mount command. For example: “mount /dev/vdb1 /mnt”.

Once the mount is complete, perform the following steps:

  1. Delete the grub configuration file with the command:

rm /mnt/boot/grub/grub.cfg

  1. Copy the generic kernel from the temporary ECS to the /boot directory of the frozen ECS:

cp /boot/vmlinuz-5.4.0-170-generic /mnt/boot/vmlinuz-5.4.0-170-generic

Important: The kernel name used was just an example; it is necessary to copy the kernel used by the temporary ECS. If in doubt, use the “uname -r” command to list the running kernel version.

  1. Copy the initrd from the temporary ESC to the /boot directory of the frozen ECS:

cp /boot/initrd.img-5.4.0-170-generic /mnt/boot/initrd.img-5.4.0-170-generic

Important: Copy the initrd for the kernel copied in step 2.0. If there is no initrd, generate one with the “update-initramfs -u” command.

Remove the data disk with the command “umount /dev/vdb1”.

Once done, put the frozen ECS system disk back into the original ECS, following the step-by-step instructions in item 4.0 of this document. Once done, start the machine and “remotely log in” to it via the console.

Grub shell

Run the “ls” command to list the disk partitions seen by Grub. To identify which is the correct partition to use, run the “ls (hd0,gpt1)/” command until you find the partition with the contents of the system disk, replacing “hd0,gpt1” with the partitions seen by the “ls”.

Once you find the correct partition, perform the following steps to boot the temporary ECS kernel in single-user mode:

  1. Replacing (hd0,gpt1) with the partition found above;
    set root=(hd0,gpt1)
    
  2. Replacing “vmlinuz-5.4.0-170-generic” with the kernel copied from the temporary ECS in item 4.0 of this document and replacing “vda1” according to the partition found. Example: (hd0,gp1) = vda1, (hd1,gpt1) = vdb1, (hd3,gpt2) = /dev/vdd2, and so on;
    linux /boot/vmlinuz-5.4.0-170-generic root=/dev/vda1 ro single
    
  3. Replacing “initrd.img-5.4.0-170-generic” with the initrd copied from the temporary ECS in item 4.0 of this document;
    initrd /boot/initrd.img-5.4.0-170-generic
    
  4. Finish booting.
    boot
    

After entering the boot command, the ECS will boot in single-user mode. Enter the ECS root password when prompted.

Single-user

Use the “apt-get remove cloud-init -y” command to uninstall cloud-init.

Use the “update-grub” command to generate the previously deleted grub configuration file.

Use the command “grep ‘menuentry ‘ /boot/grub/grub.cfg” to list the kernel versions on the system and copy the desired version so that Grub will boot by default.

Use the command “vim /etc/default/grub” to modify the grub configuration file. Change the parameters grub_default={kernel name copied above}, “grub_timeout_style=menu” and “grub_timeout=10”.

Use the “update-grub” command to update the grub configuration file again.

Use the “reboot” command to restart the ECS. Note that the machine will now boot normally.

Configurations

Check the ECS connectivity with the “ip a” command. If the ECS does not have the eth0 interface configured correctly, there may be a conflict in the netplan program configuration. If the ECS has normal connectivity, skip section 7.1 of this document.

Netplan

Type the command “vim /etc/netplan/50-cloud-init.yaml” to open the ECS network configuration file and add the eth0 interface as follows:

Once done, apply the settings made with the “netplan apply” command

If connectivity has not yet returned to normal, check the installation of the drivers KVM from the following documentation: https://support.huaweicloud.com/intl/en-us/usermanual-ims/ims_01_0326.html#ims_01_0326__section1865536911274.

Config

If the VM was migrated from Azure, you will need to change the machine’s Yum repositories to point to the Huawei repository:

sed -i 's/azure.archive.ubuntu.com/repo.huaweicloud.com/g' /etc/apt/sources.list

apt autoclean && apt update

Once you have changed the repositories, reinstall cloud-init with the command:

apt-get install cloud-init

Important: Do not install version 23.3.3 of cloud-init.

Install a new version of the Linux kernel with the command:

apt-get install linux-image-generic

Use the command “grep ‘menuentry ‘ /boot/grub/grub.cfg” to list the kernel versions on the system and copy the latest installed version.

Use the command “vim /etc/default/grub” to modify the grub configuration file. Change the parameters grub_default={kernel name copied above}.

Use the “update-grub” command to update the grub configuration file again.

Use the “reboot” command to restart ECS on the updated kernel.

Settings (Optional)

In addition to the above settings, it is also recommended that the Azure agent, which is installed by default on Azure VMs, be uninstalled, since the agent constantly reports logs to the VNC console, which may affect VNC performance:

Enter the following command to uninstall the Azure agent:

sudo apt -y remove walinuxagent
apt-get remove -y linux-azure-*
apt-get remove -y *azure

References