Troubleshooting cluster deployment issues

Logs are a useful resource for troubleshooting issues. If you want to delete a failing cluster, you should ﬁrst create an archive of cluster logs. Follow the following steps to start this process.

It is possible to archive the logs in S3 or in a local ﬁle (depending on the --output-file parameter).

$ pcluster export-cluster-logs --cluster-name mycluster --region eu-west-1 \ --bucket bucketname --bucket-prefix logs

{ "url": "https://bucketname.s3.eu-west-1.amazonaws.com/export-log/mycluster-logs-202109071136.tar.gz?..."

}

# use the --output-file parameter to save the logs locally

$ pcluster export-cluster-logs --cluster-name mycluster --region eu-west-1 \ --bucket bucketname --bucket-prefix logs --output-file /tmp/archive.tar.gz { "path": "/tmp/archive.tar.gz"

}

The archive contains the Amazon CloudWatch Logs streams and AWS CloudFormation stack events from the Head Node and Compute Nodes for the last 14 days, unless speciﬁed explicitly in the conﬁguration or in the parameters for the export-cluster-logs command. Please note take the command execution may to run depending on the number of nodes in the cluster and log streams available in CloudWatch Logs. For more information about the available log streams, see Integration with Amazon CloudWatch Logs (p. 251).

Starting from 3.0.0, AWS ParallelCluster preserves CloudWatch Logs by default when a cluster is deleted. If you want to delete a cluster and preserve its logs, make sure that Monitoring > Logs >

CloudWatch > DeletionPolicy isn't set to Delete in the cluster conﬁguration. Otherwise, change the value for this ﬁeld to Retain and run the pcluster update-cluster command. Then, run pcluster delete-cluster --cluster-name <cluster_name> to delete the cluster yet retain the log group that’s stored in Amazon CloudWatch.

If compute nodes self terminate after launching and there aren't any compute node logs for it in CloudWatch, submit a job that triggers a cluster scaling action. Wait for the instance to fail and retrieve the instance console log.

parallelcluster/slurm_resume.log ﬁle.

Retrieve the instance console log with the following command.

aws ec2 get-console-output --instance-id i-abcdef01234567890

The console output log can help you debug the root cause of a compute node failure when the compute node log isn't available.

Troubleshooting cluster deployment issues

If your cluster fails to be created and rolls back stack creation, you can look through the log ﬁles to diagnose the issue. The failure message likely looks like the following output:

$ pcluster create-cluster --cluster-name mycluster --region eu-west-1 \ --cluster-configuration cluster-config.yaml

{

"cluster": {

"clusterName": "mycluster",

"cloudformationStackStatus": "CREATE_IN_PROGRESS",

"cloudformationStackArn": "arn:aws:cloudformation:eu-west-1:xxx:stack/

mycluster/1bf6e7c0-0f01-11ec-a3b9-024fcc6f3387",

Troubleshooting cluster deployment issues

"region": "eu-west-1", "version": "3.1.2",

"clusterStatus": "CREATE_IN_PROGRESS"

} }

$ pcluster describe-cluster --cluster-name mycluster --region eu-west-1 { "creationTime": "2021-09-06T11:03:47.696Z",

...

To diagnose the cluster creation issue you can use the pcluster get-cluster-stack-events (p. 86) command, by ﬁltering for CREATE_FAILED status. For more information, see Filtering AWS CLI output in the AWS Command Line Interface User Guide.

$ pcluster get-cluster-stack-events --cluster-name mycluster --region eu-west-1 \ --query 'events[?resourceStatus==`CREATE_FAILED`]'

[

{ "eventId": "3ccdedd0-0f03-11ec-8c06-02c352fe2ef9",

"physicalResourceId": "arn:aws:cloudformation:eu-west-1:xxx:stack/

mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387", "resourceStatus": "CREATE_FAILED",

"resourceStatusReason": "The following resource(s) failed to create: [HeadNode]. ", "stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",

"stackName": "mycluster",

"logicalResourceId": "mycluster",

"resourceType": "AWS::CloudFormation::Stack", "timestamp": "2021-09-06T11:11:51.780Z"

{ "eventId": "HeadNode-CREATE_FAILED-2021-09-06T11:11:50.127Z", "physicalResourceId": "i-04e91cc1f4ea796fe",

"resourceStatus": "CREATE_FAILED",

"resourceStatusReason": "Received FAILURE signal with UniqueId i-04e91cc1f4ea796fe", "resourceProperties": "{\"LaunchTemplate\":{\"Version\":\"1\",\"LaunchTemplateId\":

\"lt-057d2b1e687f05a62\"}}",

"stackId": "arn:aws:cloudformation:eu-west-1:xxx:stack/mycluster/1bf6e7c0-0f02-11ec-a3b9-024fcc6f3387",

"stackName": "mycluster",

"logicalResourceId": "HeadNode", "resourceType": "AWS::EC2::Instance", "timestamp": "2021-09-06T11:11:50.127Z"

}]

In the previous example, the failure was in the HeadNode setup. To debug this kind of issue, you can list the log streams available from the head node with the pcluster list-cluster-log-streams (p. 88) by ﬁltering for node-type, and then analyzing the log list-cluster-log-streams content.

$ pcluster list-cluster-log-streams --cluster-name mycluster --region eu-west-1 \ --filters 'Name=node-type,Values=HeadNode'

{

Troubleshooting cluster deployment issues

"logStreams": [ {

"logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/

mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init",

...

}, {

"logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/

mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.chef-client",

...

}, {

"logStreamArn": "arn:aws:logs:eu-west-1:xxx:log-group:/aws/parallelcluster/

mycluster-202109061103:log-stream:ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init", "logStreamName": "ip-10-0-0-13.i-04e91cc1f4ea796fe.cloud-init",

...

}, ...

] }

The two primary log streams that you can use to pinpoint initialization errors are the following:

• cfn-init is the log for the cfn-init script. First check this log stream. You're likely to see the Command chef failed error in this log. Look at the lines immediately before this line for more speciﬁcs connected with the error message. For more information, see cfn-init.

• cloud-init is the log for cloud-init. If you don't see anything in cfn-init, then try checking this log next.

You can retrieve the content of the log stream by using the pcluster get-cluster-log-events (p. 85) (note the --limit 5 option to limit the number of retrieved get-cluster-log-events):

$ pcluster get-cluster-log-events --cluster-name mycluster \

--region eu-west-1 --log-stream-name ip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init \ --limit 5

{ "nextToken": "f/36370880979637159565202782352491087067973952362220945409/s", "prevToken": "b/36370880752972385367337528725601470541902663176996585497/s", "events": [

{

"message": "2021-09-06 11:11:39,049 [ERROR] Unhandled exception during build: Command runpostinstall failed",

"timestamp": "2021-09-06T11:11:39.049Z"

}, {

"message": "Traceback (most recent call last):\n File \"/opt/aws/bin/cfn-init

\", line 176, in <module>\n worklog.build(metadata, configSets)\n File \"/usr/

lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 135, in build\n Contractor(metadata).build(configSets, self)\n File \"/usr/lib/python3.7/site-packages/

cfnbootstrap/construction.py\", line 561, in build\n self.run_config(config, worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 573, in run_config\n CloudFormationCarpenter(config, self._auth_config).build(worklog)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py\", line 273, in build

\n self._config.commands)\n File \"/usr/lib/python3.7/site-packages/cfnbootstrap/

command_tool.py\", line 127, in apply\n raise ToolError(u\"Command %s failed\" % name)", "timestamp": "2021-09-06T11:11:39.049Z"

}, {

"message": "cfnbootstrap.construction_errors.ToolError: Command runpostinstall failed",

在文檔中 AWS ParallelCluster (頁 172-175)