Metrics to Watch for AWS EBS Volume Health

After you have migrated to the cloud, much of the hard work is done, but that doesn’t mean that you can begin ignoring your set-up. To get the most value out of your volumes in the Amazon Elastic Block Store (EBS), you must maintain awareness of the operating conditions of your system. Monitoring allows you to ensure that you are running healthy applications and using optimal instance and volume configurations and can be done using CloudWatch metrics.

Free image via Pixabay

CloudWatch collects predefined metrics for you at five minute intervals, for General Purpose SSD, Throughput Optimized HDD, and Cold HDD volumes, or one minute intervals, for Provisioned IOPS volumes. When you request this data for viewing, you must include a Period request indicating the amount of time for which you want to view data, and this period will be used to calculate the values returned.

You can monitor the metrics of your EBS volumes provided by CloudWatch through the AWS web console, the AWS CLI, or via third-party tools that integrate with the CloudWatch API.

Key Metrics for AWS EBS Monitoring

There are 4 categories of metrics that you should monitor to ensure the health of your drive: disk I/O, latency, disk activity and status. You need to monitor disk space as well, so you know how much of your storage volume you’re using and on which disks. Unfortunately, this cannot be done through Cloudwatch and must be checked either manually or through another means, such as a third-party tool.

Disk I/O

The two types of metrics related to disk I/O are throughput and I/O operations per second (IOPS). When examined together these metrics will indicate if you need to adjust the size of your AWS EBS volumes or switch from SSD or HDD to the other. They also indicate if you could benefit from a caching layer, if you’re seeing high queue volumes, or a load balancer, if you’re seeing a lot of activity on a few volumes but none on others.

VolumeReadBytes and VolumeWriteBytes

These metrics give you the amount of data, in bytes, being read or written to your volumes. You can choose either to view the average, which reports the average size of each operation during the period, or the sum, which gives the total number of bytes transferred during the period or the overall throughput.

VolumeReadOps and VolumeWriteOps

These values will give you a count of read and write operations performed during the period you choose to report on. They do not tell you the size of the operations. Your IOPS value is found by dividing the total number of operations by the number of seconds in your chosen period.

VolumeThroughputPercentage

This metric applies only to Provisioned IOPS SSD volumes and gives the percentage of provisioned IOPS used during the period. Keep in mind, If there is only one operation being performed the metric will report 100%, and if actions such as snapshot creation or initial data access occurred you should expect to see degraded performance.

VolumeConsumedReadWriteOps

This metric applies only to Provisioned IOPS SSD volumes and gives the total number of operations, normalized to 256K units, used within a period. Remember when provisioning, that any operation that is 256K or smaller counts as one IOPS use, making provisioned volumes a poor choice if you are performing many operations smaller than 256K.

Disk Activity

Monitoring metrics on disk activity will help you optimize volume usage and ensure that you are not paying for resources that aren’t being used. It can also help you decide, in conjunction with disk I/O and latency metrics, if the type of volumes you’re using and the limits you have set on them are correct for your purposes.

VolumeIdleTime

This metric gives the total number of seconds during a period that no operations occurred. In general, idle time means wasted resources, but a sudden increase in time can indicate issues where operation requests aren’t being sent to your volumes, like in the case of application errors.

VolumeQueueLength

This gives the number of operation requests waiting to be completed over a given period. If you are frequently seeing a queue length of zero, particularly with notable idle times, you are not using your resources efficiently. It is recommended to aim for a queue length of one for every 500 IOPS on SSD volumes and of four for every 1 MiB of sequential operations on HDD volumes.

BurstBalance

This metric only applies to General Purpose SSD, Throughput Optimized HDD, and Cold HDD volumes and gives the percentage of I/O or throughput credits remaining in your burst bucket. Remember that data is only reported when a volume is active and that volumes that are 1 TiB or larger will never use I/O credits. If you are consistently running out of burst credits, you should probably switch to higher capacity volumes or io1 volumes.

Latency

Latency is the amount of time between when an operation request is sent and when it’s completed. In combination with disk activity metrics, these metrics can help you identify performance issues and let you know if you need to set higher IOPS limits on your volumes.

VolumeTotalReadTime and VolumeTotalWriteTime

These metrics tell you the total number of seconds spent by all operations during the specified period. Each operation time is counted regardless of whether it occurred simultaneously with another operation, meaning that the total time could be greater than that of the period.

Status Checks and Events

Although they are not truly metrics, the status checks of your volumes should be reviewed and used in combination with metrics to understand the overall health of your EBS set-up. AWS automatically runs a number of checks that can alert you to issues with your stored data or volume performance. You cannot modify these checks but you can customize alerts triggered by the results of checks.

Volume status

Volume status is reported as either ok when all checks pass, impaired when any checks fail, or insufficient-data when checks are incomplete. These checks are performed every five minutes. If data inconsistencies are detected, AWS will automatically block your affected volumes from performing operations, flag them as impaired and generate an event to notify you.

I/O performance status

I/O performance status checks only apply to provisioned volumes and are performed every minute━they indicate the difference between actual performance and expected performance of a volume. These checks will return values of ok, warning with an indication of either degraded or severely degraded performance, impaired with an indication of stalled or not available, or insufficient-data.

Events

When events are created, they include a start time and a duration indicating how long a volume was disabled. Once a volume is reenabled, the end time will be added to the event. Events include a description informing you if a volume is awaiting to be enabled, has been enabled, or what the specifics of the returned check value were.

Wrap Up

In order to really know the health of your EBS set-up and ensure that you are using your resources efficiently, you should be monitoring at minimum the discussed metrics. These can tell you what your current performance is and give you an indication of how you might need to adapt your configuration in the future as your business expands and your needs change.

AWS built-in services can help you monitor many of these indicators but they have limitations and may not be the best option for your purposes so you might want to consider custom or third-party options to make the most of your data.


<!–

Comment this news or article

–>