AWS-ETL-Workflows

Examples of architectures using a number of database, compute and etl services in AWS

View project on GitHub

Step Function for Running Glue Job

Before running this example, we have to create a glue job based on the following script and following the instructions here. For the worker type choose ‘Standard’ and we will use 4 workers.

You will also need to create an IAM role and attach the following policies:

  • arn:aws:iam::aws:policy/AmazonS3FullAccess
  • arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
  • arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

This example uses the definition to start the glue job, wait for job to complete, then query athena table and finally publish number of rows to SNS

Once we have created the glue job, we can create the state machine based on this definition by running commmand listed here and supplying definition value as json

Get the step function arn, by running the following command to list the step functions arn.

Then execute statemachine using the following command and pass arn retrieved earlier for state machine created.

$ aws stepfunctions start-execution --state-machine-arn <arn>

{
    "executionArn": "arn:aws:states:us-east-1:376337229415:execution:ETLDemo:905b2d8e-e659-4e18-ba1f-714882100324",
    "startDate": "2022-04-21T02:37:21.064000+01:00"
}

We can check status of the execution

$ aws stepfunctions describe-execution --execution-arn "arn:aws:states:us-east-1:376337229415:execution:ETLDemo:905b2d8e-e659-4e18-ba1f-714882100324"


{
    "executionArn": "<arn>",
    "stateMachineArn": "<arn>",
    "name": "905b2d8e-e659-4e18-ba1f-714882100324",
    "status": "FAILED",
    "startDate": "2022-04-21T02:37:21.064000+01:00",
    "stopDate": "2022-04-21T02:38:18.965000+01:00",
    "input": "{}",
    "inputDetails": {
        "included": true
    },
    "traceHeader": "Root=1-6260b551-db5653e799449c7169fc982b;Sampled=1"
}

If failed, we can retrieve execution history. Command below does this in reverse order and only prints out two items (so we get the latest event that failed) and the cause for failure.

$ aws stepfunctions get-execution-history  --execution-arn <enter-arn> --no-include-execution-data --reverse-order --max-items 2

{
    "events": [
        {
            "timestamp": "2022-04-21T02:38:18.965000+01:00",
            "type": "ExecutionFailed",
            "id": 9,
            "previousEventId": 0,
            "executionFailedEventDetails": {
                "error": "States.Runtime",
                "cause": "An error occurred while executing the state 'Glue StartJobRun' (entered at the event id #8). 
                The JSONPath '$.JobName' specified for the field 'JobName.$' could not be found in the input '{}'"
            }
        },
        {
            "timestamp": "2022-04-21T02:38:18.965000+01:00",
            "type": "TaskStateEntered",
            "id": 8,
            "previousEventId": 7,
            "stateEnteredEventDetails": {
                "name": "Glue StartJobRun"
            }
        }
    ],
    "NextToken": "eyJuZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAyfQ=="
}

This error is because the input json {"JobName": "flights_s3_to_s3"} was not passed via --input argument. The GlueStartJob task requires JSONPath ‘$.JobName’ from the input as defined in the state-machine definition.