Alerts from baseline monitoring in AMS

CW = CloudWatch. ARN = Amazon Resource Name. * = wildcard (any).

Amazon EC2 Simple Storage Service (S3) Get Object In S3 Allow Get

List

Allows EC2 applications to retrieve and list objects in S3 buckets in your account.

Customer Encrypted

Log S3 Access Allow PutObject Allows EC2 applications to update objects in

aws:s3:::mc-*-logs-*/encrypted/app/*

Patch Data Put

Object S3 Allow PutObject Allows EC2 applications to upload

patching data to your S3 buckets at aws:s3:::awsms-a*-patch-data-*

Uploading Own

Logs To S3 Allow PutObject Allows EC2 applications to upload

custom logs to: aws:s3:::mc-a*-logs-*/aws/instances/*/

${aws:userid}/*

Explicitly Deny MC

Namespace S3 Logs Deny GetObject*

Put*

Disallows EC2 applications getting or putting any objects from or to:

aws:s3:::mc-*-logs-*/

Delete Deny * (all) Disallows EC2 applications taking

any action on objects in:

aws:s3:::mc-a*-logs-*/*, aws:s3:::mc-a*-internal-*/*, Explicitly Deny S3

CFN Bucket Deny Delete* Disallows EC2 applications deleting

any objects from: aws:s3:::cf-templates-*

Explicitly Deny List

Bucket S3 Deny ListBucket Disallows you listing any encrypted,

audit log, or reserved (mc) objects from: aws:s3:::mc-*-logs-*

If you're unfamiliar with Amazon IAM policies, see Overview of IAM Policies for important information.

Note

Policies often include multiple statements, where each statement grants permissions to a diﬀerent set of resources or grants permissions under a speciﬁc condition.

This section describes AMS monitoring defaults; for more information, see Monitoring and event management (p. 300).

The following table shows what is monitored and the default alerting thresholds. You can change the alerting thresholds with a Management | Other | Other | Update (ct-0xdawir96cy7k) RFC after determining what changes you want and subscribing to the relevant CloudWatch Amazon SNS topic. For information about creating and subscribing to topics, see Subscribe to a Topic. For general information, see Amazon SNS FAQs. To be notiﬁed directly when alarms cross their threshold, in addition to AMS's standard alerting process, follow these instructions about how to overwrite alarm conﬁgurations, Receiving alerts generated by AMS (p. 305).

Amazon CloudWatch provides extended retention of metrics. For more information, see CloudWatch Limits.

NoteAMS calibrates its baseline monitoring on a periodic basis. New accounts are always onboarded with the latest baseline monitoring and the table describes the baseline monitoring for an account that is newly onboarded. AMS updates the baseline monitoring in existing accounts on a periodic basis and you may experience a time lag before the updates are in place. For more information, see Viewing the monitoring conﬁguration for an account (p. 303).

Alerts from baseline monitoring Resource Security

alert Alert name and trigger condition Notes

For starred (*) alerts, AMS proactively assesses impact and remediates when possible; if remediation is not possible, AMS creates an incident. Where automation fails to remediate the issue, AMS informs you of the incident case and an AMS engineer is engaged. In addition, these alerts can be sent directly to your email (if you have opted in to the Direct-Customer-Alerts SNS topic).

Application Load Balancer (ALB) instance

No RejectedConnectionCount

sum > 0 for 1 min, 5 consecutive times.

CloudWatch alarm if the number of connections that were rejected because the load balancer reached its maximum.

Application Load Balancer (ALB) target

No TargetConnectionErrorCount

sum > 0 for 1 min, 5 consecutive times.

CloudWatch alarm if number of connections were unsuccessfully established between the load balancer and the registered instances.

Aurora

instance No CPUUtilization

> 85% for 5 mins, 2 consecutive times.

CloudWatch alarm.

CPUUtilization*

>= 95% for 5 mins, 6 consecutive times.

CloudWatch alarm. High CPU utilization is an indicator of a change in application state such as dead locks, inﬁnite loops, malicious attacks, and other anomalies.

StatusCheckFailed

> 0 for 5 minutes, 3 consecutive times.

Root Volume Usage

>= 95% for 5 mins, 6 consecutive times.

EC2 instance

-all OSs No

Memory Free*

CloudWatch alarm.

Alerts from baseline monitoring in AMS Resource Security

alert Alert name and trigger condition Notes MemoryFree < 5% for 5 minutes, 6

consecutive times.

Yes EPS Malware

Malware found on instance.

CloudWatch event.

Root Volume Inode Usage Average >= 95% for 5 mins, 6

Memory Swap < 5% for 5 minutes, 6 consecutive times.

CloudWatch alarm. Applied to Linux instances only.

ElastiCache

Cluster No CurrConnections = 65000 This alarm notiﬁes AMS of the maximum connection limit of an ElastiCache Host.

CloudWatch Alarm. If you would like to update this threshold, contact AMS support.

ElastiCache

Node No CPUUtilization

Average > predeﬁned value for 15 mins, 2 consecutive times.

CloudWatch alarm. Default is 90. If Redis, use one the following values based on instance type:

• cache.t1.micro: 90%

Resource Security

alert Alert name and trigger condition Notes ElastiCache

Node -memcached

No SwapUsage

maximum > 50,000,000 bytes for 5 mins, 5 consecutive times.

CloudWatch alarm. Applied to memcached only.

OpenSearch

cluster No ClusterStatus.red

maximum is >= 1 for 1 minute, 1 consecutive time.

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

CloudWatch alarm. At least one primary shard and its replicas are not allocated to a node. To learn more, see Red Cluster Status.

KMSKeyError

>= 1 for 1 minute, 1 consecutive time.

CloudWatch alarm. The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. To learn more, see Encryption of Data at Rest for OpenSearch Service Service.

ClusterStatus.yellow

maximum is >= 1 for 1 minute, 1 consecutive time

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

At least one replica shard is not allocated to a node. To learn more, see Yellow Cluster Status.

FreeStorageSpace

minimum is <= 20480 for 1 minute, 1 consecutive time

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

A node in your cluster is down to 20 GiB of free storage space. To learn more, see Lack of Available Storage Space.

ClusterIndexWritesBlocked

>= 1 for 5 minutes, 1 consecutive time AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

The cluster is blocking write requests. To learn more, see ClusterBlockException.

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day.

To learn more, see Failed Cluster Nodes.

Alerts from baseline monitoring in AMS Resource Security

alert Alert name and trigger condition Notes CPUUtilization

average is >= 80% for 15 minutes, 3 consecutive times

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

100% CPU utilization is common, but sustained high averages are problematic. Consider using larger instance types or adding instances.

JVMMemoryPressure

maximum is >= 80% for 5 minutes, 3 consecutive times

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

The cluster could encounter out of memory errors if usage increases.

Consider scaling vertically. Amazon ES uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.

MasterCPUUtilization

average is >= 50% for 15 minutes, 3 consecutive times

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower average CPU usage than data nodes.

MasterJVMMemoryPressure

maximum is >= 80% for 15 minutes, 1 consecutive time

AMS takes pro-active actions to reduce operational impact, when this alert is triggered.

OpenSearch

instance No AutomatedSnapshotFailure maximum is >= 1 for 1 minute, 1 consecutive time.

CloudWatch alarm. An automated snapshot failed. This failure is often the result of a red cluster health status. See Red Cluster Status.

SurgeQueueLength

> 100 for 1 minute, 15 consecutive times.

CloudWatch alarm if an excess number of requests are pending routing.

Elastic Load Balancing

instance No

HTTPCode_ELB_5XX_Count

sum > 0 for 5 min, 3 consecutive times.

CloudWatch alarm on excess number of HTTP 5XX response codes that originate from the load balancer.

Resource Security

alert Alert name and trigger condition Notes SpilloverCount

> 1 for 1 minute, 15 consecutive times.

CloudWatch alarm if an excess number of requests that were rejected because the surge queue is full.

GuardDuty

service Yes Not applicable; all ﬁndings (threat purposes) are monitored. Each ﬁnding corresponds to an alert.

Changes in the GuardDuty ﬁndings.

These changes include newly generated ﬁndings or subsequent occurrences of existing ﬁndings.

List of supported GuardDuty ﬁnding types are on GuardDuty Active Finding Types.

Health Varies AWS Personal Health Dashboard. Notiﬁcations sent when there are changes in the status of AWS Personal Health Dashboard (AWS Health) events.

Service event example: Scheduled EC2 instance store retirement.

These Health events are not monitored:

AWS Managed Microsoft AD instance sends an active status event.

Service event. Emitted when the directory is operating normally after an event.

Impaired Directory Status

AWS Managed Microsoft AD instance sends an impaired directory status event.

Service event. Emitted when the directory is running in a degraded state. One or more issues have been detected, and not all directory operations may be working at full operational capacity.

Inoperable Directory Status

AWS Managed Microsoft AD instance sends an inoperable status event.

Service event. Emitted when the directory is not functional. All directory endpoints have reported issues.

AWSManaged

Microsoft AD No

Deleting Directory Status

AWS Managed Microsoft AD instance sends a deleting directory status event.

Service event. Emitted when the directory is currently being deleted.

Alerts from baseline monitoring in AMS Resource Security

alert Alert name and trigger condition Notes Failed Directory Status

AWS Managed Microsoft AD instance sends a failed status event.

Service event. Emitted when the directory could not be created.

RestoreFailed Directory Status AWS Managed Microsoft AD instance sends a restore failed directory status event.

Service event. Emitted when restoring the directory from a snapshot failed.

Failover not attempted

Amazon RDS is not attempting a requested failover because a failover recently occurred on the DB instance.

Service event. RDS-EVENT-0034, Amazon RDS Event Categories and Event Messages.

DB instance partial failover recovery complete

The instance has recovered from a partial failover.

Service event. RDS-EVENT-0065, Amazon RDS Event Categories and Event Messages.

DB instance fail

The DB instance has failed due to an incompatible conﬁguration or an underlying storage issue. Begin a point-in-time-restore for the DB instance.

Service event. RDS-EVENT-0031, Amazon RDS Event Categories and Event Messages.

Invalid subnet IDs DB instance The DB instance is in an incompatible network. Some of the speciﬁed subnet IDs are invalid or do not exist.

Service event. RDS-EVENT-0036, Amazon RDS Event Categories and Event Messages.

DB instance invalid parameters For example, MySQL could not start because a memory-related parameter is set too high for this instance class, so the customer action would be to modify the memory parameter and reboot the DB instance.

Service event. RDS-EVENT-0035, Amazon RDS Event Categories and Event Messages.

Amazon RDS

instance No

Error create statspack user account Error while creating Statspack user account PERFSTAT. Drop the account before adding the Statspack option.

Service event. RDS-EVENT-0058, Amazon RDS Event Categories and Event Messages.

Resource Security

alert Alert name and trigger condition Notes DB instance without enhanced

monitoring

Enhanced Monitoring can't be enabled without the enhanced monitoring IAM role. For information about creating the enhanced monitoring IAM role, see To create an IAM role for Amazon RDS Enhanced Monitoring.

Service event. RDS-EVENT-0079, Amazon RDS Event Categories and Event Messages.

DB instance enhanced monitoring disabled

Enhanced Monitoring was disabled due to an error making the conﬁguration change. It's likely that the enhanced monitoring IAM role is conﬁgured incorrectly. For information about creating the enhanced monitoring IAM role, see To create an IAM role for Amazon RDS Enhanced Monitoring.

Service event. RDS-EVENT-0080, Amazon RDS Event Categories and Event Messages.

Invalid permissions recovery S3 bucket The IAM role that you use to access your Amazon S3 bucket for SQL Server native backup and restore is conﬁgured incorrectly. For more information, see Setting Up for Native Backup and Restore.

Service event. RDS-EVENT-0081, Amazon RDS Event Categories and Event Messages.

DB instance read replica error An error has occurred in the read replication process. For more information, see the event message. For information on troubleshooting Read Replica errors, see Troubleshooting a MySQL Read Replica Problem.

Service event. RDS-EVENT-0045, Amazon RDS Event Categories and Event Messages.

DB instance read replication ended Replication on the Read Replica was ended.

Service event. RDS-EVENT-0057, Amazon RDS Event Categories and Event Messages.

Alerts from baseline monitoring in AMS Resource Security

alert Alert name and trigger condition Notes DB instance recovery start

The SQL Server DB instance is re-establishing its mirror. Performance will be degraded until the mirror is reestablished. A database was found with non-FULL recovery model. The recovery model was changed back to FULL and mirroring recovery was started. (<dbname>: <recovery model found>[,…])”.

Service event. RDS-EVENT-0066, Amazon RDS Event Categories and Event Messages.

Low Storage alert triggers when the allocated storage for the DB instance has been exhausted.

RDS-EVENT-0007, see details at Using Amazon RDS event notiﬁcation.

Low storage alert when the DB instance has consumed more than 90% of its allocated storage

RDS-EVENT-0089, see details at Amazon RDS Event Categories and Event Messages.

Notiﬁcation service when scaling failed

for the Aurora Serverless DB cluster. RDS-EVENT-0143, see details at Amazon RDS Event Categories and Event Messages.

CPUUtilization

Average CPU utilization > 75% for 15 mins, 2 consecutive times.

DiskQueueDepth

Sum is > 75 for 1 mins, 15 consecutive times.

FreeStorageSpace

Average < 1,073,741,824 bytes for 5 mins, 2 consecutive times.

ReadLatency

Average >= 1.001 seconds for 5 mins, 2 consecutive times.

WriteLatency

Average >= 1.005 seconds for 5 mins, 2 consecutive times.

SwapUsage

Average >= 104,857,600 bytes for 5 mins, 2 consecutive times.

CloudWatch alarm.

Resource Security

alert Alert name and trigger condition Notes Amazon

Redshift cluster

No RedshiftClusterStatus

The health of the cluster when not in maintenance mode < 1 for 5 min.

1 represents a healthy cluster.

Amazon

Macie Yes Newly generated alerts and updates to existing alerts.

Macie ﬁnds any changes in the ﬁndings. These changes include newly generated ﬁndings or subsequent occurrences of existing ﬁndings.

Amazon Macie alert. For a list of supported Macie alert types, see Analyzing Amazon Macie Findings.

Note that Macie is not enabled for all accounts.

AMS takes pro-active actions (scaling the cluster) when this alert is triggered.

For information on remediation eﬀorts, see AMS automatic remediation of alerts (p. 305).

在文檔中 AMS Advanced User Guide (頁 112-121)