Creating and Managing Alarms
V1.1 – December 2023
Version | Author | Description |
---|---|---|
V1.0 – 2023-12-20 | Diogo Hatz 50037923 | Initial Version |
V1.0 – 2023-12-21 | Wisley da Silva 00830850 | Document Review |
Introduction
Cloud Eye (CES) is a free tool for monitoring Huawei Cloud resources. In addition to resource monitoring, Cloud Eye can also be used to create event- or metric-based alarms, identify resource malfunctions, and quickly react to resource changes. It is worth noting that, although Cloud Eye is a free service, charges for sending notifications when alarms are triggered are charged.
This document aims to describe the main functionalities of the Cloud Eye service and guide the reader to use CES for monitoring cloud resources, such as ECSs, VPNs, and CBRs, etc. In addition, it also describes how to create event- or metric-based alarms and customize dashboards for resource monitoring.
Cloud Eye on the console
Overview
When you open Cloud Eye on the console, the home page that will load is the Overview, where you can see an overview of all resources used in Huawei Cloud, the overall network, CPU, memory, and disk utilization, and which resources have recently triggered alarms and need further attention.
-
Resource Overview: Allows you to view the total number of monitored resources and the alarms generated for these resources.
-
Alarm Statistics: Shows the alarms triggered in the last seven days by alarm severity.
-
Server Monitoring: Allows you to view the overall CPU and memory utilization of monitored servers and a list of the top 5 ECSs ranked by CPU or memory utilization.
-
Network Monitoring: Shows the overall bandwidth utilization of EIPs and a list of the top 5 EIPs ranked by bandwidth utilization.
-
Storage Monitoring: Allows you to view the overall disk utilization (EVS) by read and write IOPS and a list of the top 5 disks ranked by IOPS.
You can see what the Cloud Eye home page looks like in the images below:
Resource groups
Resource groups allow you to group multiple Huawei Cloud resources for joint monitoring, and also facilitate the management of alarms for multiple resources in batches.
A resource group can be created in the Resource Groups section by clicking Create Resource Group.
On the page that loads, choose a name for the resource group in Name and select the resources to add to the group by service. After adding all the desired resources, click Create.
You can create alarms for a specific resource group, making it easy to create batch alarms for multiple resources that share the same context.
Alarm management
The alarm management section has the following subsections:
-
Alarm rules: Subsection used to view and create alarms based on metrics or events.
-
Alarm history: Subsection used to view triggered alarms.
-
Alarm template: Subsection related to viewing alarm templates.
-
One-click monitoring: Subsection that allows you to enable monitoring for common service events
-
Alarm mask: Subsection that allows you to create alarm masks so that triggered alarms are not notified.
Creating an alarm
To create an alarm for a specific resource based on an event or metric, navigate to the Alarm Rules section in Alarm Management and click Create Alarm Rule.
Configure the basic alarm settings, such as the alarm name in Name and the type of resource to be monitored in Resource Type, as well as its scope in Dimension. To configure an alarm for an ECS, for example, the Resource Type is Elastic Cloud Server and the Dimension is ECSs.
If the alarm triggering condition is a metric, such as the ECS CPU or memory utilization rate, select Metric in Alarm Type. If the alarm triggering condition is an event, for example, the event of an uninstalled GPU, select Event in the Alarm Type field. In this example, the metric to trigger the alarm will be the ECS CPU utilization above 80%.
In Monitoring Scope, the specific resource that will trigger the alarm must be configured. The resource can be selected in three different ways:
- All resources: Select this option if the alarm can be triggered by all instances of the selected resource.
- Resource groups: Select this option if the alarm can be triggered by all resources present in a resource group. See section 3.3.
- Specific resources: Select this option to choose a specific instance of the selected service to trigger the alarm.
In this example, the ECS “ecs-4194” will be selected as the instance that can trigger the alarm in the Specific resources field under Monitoring Scope.
There are three different ways to configure the metric that will trigger the alarm in Method:
-
Associate template: In this option, the metric to trigger the alarm will be configured based on an existing template.
-
Use existing template: In this option, the metric to trigger the alarm will be configured based on an existing template.
-
Configure manually: In this option, the metric to trigger the alarm will be configured manually, which allows for greater flexibility.
In this example, the metric that will be configured to trigger the alarm will be if the ECS CPU usage is greater than or equal to 80%. In Metric Name, it is possible to select the metric that can trigger the alarm, as in this case it is (Agent) CPU Usage (Recommended). For this option to be selected, the Cloud Eye agent must be installed, as done in section 3.5. It is important that the agent is installed on monitored servers to obtain better accuracy in data monitoring and a wider range of monitoring metrics.
In Alarm Policy, you can select the type of data that will be analyzed, such as raw data, average, maximum, minimum, variance or sum of the ingested data; as well as the percentage that will trigger the alarm and the form of comparison, such as greater than or equal, greater than, less than, less than or equal, increase in relation to or decrease in relation to.
In the Alarm Notification section, you can configure a notification for alarms triggered by email, SMS, HTTP and HTTPS requests or through a trigger in FunctionGraph. If the notification only needs to be sent to the email of the account owner in the Huawei Cloud console, you can select the Account contact option in Notification Object directly. In this example, a second email address will be configured to receive notifications of Cloud Eye alarms. To do this, you must first enable the Short Message Notification (SMN) service. Click Topics to view the notification topics that you have created.
Click Create Topic to create a notification topic.
Enter the name of the notification topic in Topic Name and click OK.
Next, click Add Subscription to add a communication channel through which the notification will be sent.
Next, choose the protocol for sending the notification, in this case Email, and enter the chosen email in Endpoints. Click OK.
An email will be sent to the selected endpoint. For the SMN service to work correctly, the user must confirm their email through a confirmation that will be sent as soon as the Subscription is configured.
Returning to the alarm creation, select the topic created in the previous steps in Notification Object and configure the time window in which the notification can be sent in Notification Window.
In addition, also select the occasions in which the notification will be sent in the Trigger Condition: when the alarm is generated or when the alarm is cleared. After configuring, select Create to create the alarm.
In Alarm Rules you can see the created alarms and their statuses, as well as the resource that is monitored and the alarm activation policy.
After an alarm is triggered, you can view it in the Alarm Records section in Alarm Management.
You can also view the notification generated by the alarm on the endpoint selected for sending the notification in the SMN service. In another context, the following email was generated for monitoring a bucket in the OBS service for object storage in Huawei Cloud:
The tables for the metrics and events monitored for Huawei Cloud ECS, VPN, NAT, and CBR services have been included in the appendices section 4.0 of this document. To create event-based alarms or metrics for these services, the procedure is the same as that described above.
Attachments
Server Monitoring Metrics
Metrics | No Agent | Agent Installed |
---|---|---|
CPU Usage | Yes | Yes / Dedicated |
Disk Usage | Yes | Yes |
Memory Usage | Yes | Yes / Dedicated |
Disk Write Bandwidth | Yes | Yes |
Disk Read Bandwidth | Yes | Yes |
Disk Write IOPS | Yes | Yes |
Disk Read IOPS | Yes | Yes |
In-Band Rate | Yes | Yes |
In-Band Rate | Yes | Yes |
In-Band Rate | Yes | Yes |
Out-Band Rate | Yes | Yes |
Out-Band Rate | Yes | Yes |
CPU Credit Usage | Yes | Yes |
CPU Credit Balancing | Yes | Yes |
CPU Credit Balancing Overage | Yes | Yes |
CPU Loaded Credit Overage | Yes | Yes |
Network Connections | Yes | Yes |
Inbound Bandwidth Per Server | Yes | Yes |
Outbound Bandwidth Per Server | Yes | Yes |
Inbound PPS | Yes | Yes |
Outbound PPS | Yes | Yes |
New Connections | Yes | Yes |
Aggregate ECC Uncorrectable Errors | Yes | Yes |
Pages Retired with Single Bit Errors | Yes | Yes |
Pages Retired with Double Bit Errors | Yes | Yes |
GPU Health Status | Yes | Yes |
GPU Encoder Usage | Yes | Yes |
GPU Decoder Usage | Yes | Yes |
ECC Volatile Correctable Errors | Yes | Yes |
ECC Volatile Uncorrectable Errors | Yes | Yes |
CPU Idle | No | Yes / Dedicated |
User space CPU usage | No | Yes / Dedicated |
Kernel space CPU usage | No | Yes / Dedicated |
Other processes CPU usage | No | Yes / Dedicated |
Optimal processes CPU usage | No | Yes / Dedicated |
Time the CPU is waiting for I/O operations | No | Yes / Dedicated |
CPU interrupt time | No | Yes / Dedicated |
Software CPU interrupt time | No | Yes / Dedicated |
Available memory | No | Yes / Dedicated |
Idle memory | No | Yes / Dedicated |
Buffer | No | Yes / Dedicated |
Cache | No | Yes / Dedicated |
Input bandwidth per NIC | No | Yes / Dedicated |
Output bandwidth per NIC | No | Yes / Dedicated |
Packet rate sent per NIC | No | Yes / Dedicated |
Packet rate received per NIC | No | Yes / Dedicated |
Error packet rate received per NIC | No | Yes / Dedicated |
Error packet rate transmitted per NIC | No | Yes / Dedicated |
Received packets dropped rate per NIC | No | Yes / Dedicated |
Transmitted packets dropped rate per NIC | No | Yes / Dedicated |
Running processes | No | Yes / Dedicated |
Idle processes | No | Yes / Dedicated |
Zombie processes | No | Yes / Dedicated |
Blocked processes | No | Yes / Dedicated |
Sleeping processes | No | Yes / Dedicated |
Total processes | No | Yes / Dedicated |
TCP retransmission rate | No | Yes / Dedicated |
TCP SYS_SENT | No | Yes / Dedicated |
TCP SYS_RECV | No | Yes / Dedicated |
TCP FIN_WAIT1 | No | Yes / Dedicated |
TCP FIN_WAIT2 | No | Yes / Dedicated |
TCP CLOSE | No | Yes / Dedicated |
TCP LAST_ACK | No | Yes / Dedicated |
TCP LISTEN | No | Yes / Dedicated |
TCP CLOSING | No | Yes / Dedicated |
Average CPU load in the last minute | No | Yes / Dedicated |
Average CPU load in the last 15 minutes | No | Yes / Dedicated |
Average CPU load in the last 5 minutes | No | Yes / Dedicated |
TCP ESTABLISHED | No | Yes / Dedicated |
TCP TOTAL | No | Yes / Dedicated |
UDP TOTAL | No | Yes / Dedicated |
NTP Offset | No | Yes / Dedicated |
Total files processed | No | Yes / Dedicated |
VPN Gateway Monitoring Metrics
Metrics | Supported |
---|---|
Ingress Packet Rate | Yes |
Egress Packet Rate | Yes |
Ingress Bandwidth | Yes |
Egress Bandwidth | Yes |
Ingress Bandwidth Usage | Yes |
Number of Connections | Yes |
Egress Bandwidth Usage | Yes |
VPN Connection Monitoring Metrics
Metrics | Supported |
---|---|
Tunnel Average RTT | Yes |
Tunnel Max RTT | Yes |
Tunnel Packet Loss Rate | Yes |
Link Average RTT | Yes |
Link Max RTT | Yes |
Link Packet Loss Rate | Yes |
VPN Connection Status | Yes |
Packet Receive Rate | Yes |
Packet Send Rate | Yes |
Traffic Receive Rate | Yes |
Traffic Send Rate | Yes |
SA packet sending rate | Yes |
SA packet receiving rate | Yes |
SA traffic sending rate | Yes |
SA traffic receiving rate | Yes |
NAT monitoring metrics
Metrics | Supported |
---|---|
SNAT connections | Yes |
Inbound bandwidth | Yes |
Outbound bandwidth | Yes |
Inbound PPS | Yes |
Outbound PPS | Yes |
Inbound traffic | Yes |
Outbound traffic | Yes |
SNAT connections usage rate | Yes |
Inbound bandwidth usage rate | Yes |
Outbound bandwidth usage rate | Yes |
Total outbound bandwidth (UDP) | Yes |
Total outbound bandwidth (TCP) | Yes |
Total inbound bandwidth (UDP) | Yes |
Total inbound bandwidth (TCP) | Yes |
Packets lost due to excessive SNAT connections | Yes |
Packets lost due to excessive PPS | Yes |
Packets lost by all allocated EIP ports | Yes |
Events monitored for CBR alarm
Events | Supported |
---|---|
Agent online | Yes |
Agent offline | Yes |
Failed to create backup | Yes |
Failed to restore resource from backup | Yes |
Failed to delete backup | Yes |
Failed to delete vault | Yes |
Backup was successful | Yes |
Restore resource from backup was successful | Yes |
Backup was deleted successfully | Yes |
Vault was deleted successfully | Yes |
Error during replication | Yes |
Replication was successful | Yes |
Events monitored for server alarms
Events | Supported |
---|---|
Redeployment scheduled to be authorized | Yes |
Local disk swap canceled | Yes |
Local disk swap to be executed | Yes |
Xid event alarm triggered on GPU | Yes |
Spec modification scheduled to be executed | Yes |
Migration scheduled to be executed | Yes |
Shutdown scheduled to be executed | Yes |
Reboot scheduled to be executed | Yes |
Redeployment scheduled to be executed | Yes |
Unrecoverable ECC errors generated by GPU SRAM | Yes |
Inforom alarm generated on GPU | Yes |
ECC double bit alarm generated on GPU | Yes |
Excessive retired pages | Yes |
ECC alarm generated on GPU a100 | Yes |
ECC failure on GPU memory page retirement | Yes |
ECC failure on GPU page retirement | Yes |
Too many single bit ECC errors on GPU | Yes |
Video card not found | Yes |
Faulty GPU link | Yes |
Video card lost | Yes |
Faulty GPU memory page | Yes |
Faulty GPU engine image | Yes |
GPU temperature too high | Yes |
Faulty GPU NVLink | Yes |
nvidia-smi hang | Yes |
ECS cleared | Yes |
ECS restarted | Yes |
ECS shut down | Yes |
NIC deleted | Yes |
ECS resized | Yes |
Hardware error reboot | Yes |
Hardware error reboot successful | Yes |
Auto-recovery timeout | Yes |
Initialization error | Yes |
GPU link error | Yes |
FPGA link error | Yes |
ECS error due to abnormal processes on host | Yes |
GuestOS restarted | Yes |
Migration started | Yes |
Migration completed successfully | Yes |
Error during migration | Yes |
Risk of host crash | Yes |
Unrecoverable ECC errors: NPU | Yes |
References
- CES documentation: https://support.huaweicloud.com/intl/en-us/function-ces/index.html
- CES limitations: https://support.huaweicloud.com/intl/en-us/productdesc-ces/ces_07_0007.html
- FAQ: https://support.huaweicloud.com/intl/en-us/ces_faq/ces_faq_0059.html
- CES agent batch installation: https://support.huaweicloud.com/intl/en-us/usermanual-ces/ces_01_0033.html