sdlf-pipeline
Note
sdlf-pipeline is defined in the sdlf-pipeline folder of the SDLF repository.
Infrastructure

A SDLF pipeline is a logical construct representing an ETL process. A team can implement one or more pipelines depending on their needs.
Each pipeline is divided into stages (i.e. StageA, StageB...), which map to AWS Step Functions. Each Step Functions orchestrates the process of transforming and moving data to different areas within the data lake (e.g. from RAW to STAGING area). There are two main advantages to using Step Functions as an orchestration layer. They are 1) serverless and 2) connected to the entire AWS universe, simplifying integration with other services on the platform. As many stages as necessary can be defined and modified for a pipeline.
Each Step Functions is comprised of one or more steps relating to operations in the orchestration process (e.g. Starting an Analytical Job, Running a crawler...).

An example architecture for a SDLF pipeline is detailed in the diagram above. The entire process is event-driven.
It's important to understand that this is just one example used to illustrate the orchestration process within the framework. Each team has full flexibility in terms of the number, order and purpose of the various stages and steps within their pipeline.
-
As a file lands into the RAW bucket under the
{team}/{dataset}prefix, an S3 Events Notification is created and placed in a queue -
The first stage (i.e. StageA Step Functions) of the pipeline is triggered. Initial step in the processing is to update the Objects Metadata Catalog DynamoDB table (i.e. File metadata) with details about the landed object (S3 Path, timestamp…), before a light transformation is applied. The code for this light transformation would have previously been pushed into a CodeCommit repository by the data engineer and potentially gone through a code review and testing phase before entering production. The final step is to update the object metadata catalog with the output from the transformation and send the messages to the next SQS queue
-
Every 5 minutes (customizable), an EventBridge rule fires a Lambda which checks if there are messages in the queue sent from the previous stage. If so it triggers the second Step Functions (StageB)
-
This time a heavy transformation is applied on a batch of files. This heavy transformation can be an API call to an Analytical AWS service (Glue Job, Fargate Task, EMR Step, SageMaker Notebook…) and the code is again provided by the data engineer. The state machine waits for the job to reach a SUCCEEDED state before the output is crawled to update the Glue Metadata Catalog (i.e. Tables metadata). A data quality step leveraging Glue Data Quality can also be run.
Usage
sdlf-pipeline is not very interesting by itself. It is intended for inclusion in stages (such as sdlf-stage-lambda) for stages to use a common interface (EventBridge as trigger).
CloudFormation with sdlf-cicd
Read the official SDLF workshop for an end-to-end deployment example.
rPipelineInterface:
Type: awslabs::sdlf::pipeline::MODULE
Properties:
pPipelineReference: !Ref pPipelineReference
pOrg: !Ref pOrg
pDomain: !Ref pDomain
pEnv: !Ref pEnv
pTeamName: !Ref pTeamName
pPipelineName: !Ref pPipeline
pStageName: !Ref pStageName
pStageEnabled: !Ref pStageEnabled
pTriggerType: !Ref pTriggerType
pSchedule: !Ref pSchedule
pEventPattern: !Ref pEventPattern
pLambdaRoutingStep: !GetAtt rLambdaRoutingStep.Arn
Interface
Interfacing with other modules is done through SSM Parameters. sdlf-pipeline publishes the following parameters:
| SSM Parameter | Description | Comment |
|---|---|---|
/SDLF/SQS/{team}/{pipeline}{stage}Queue |
Name of the SQS queue | |
/SDLF/SQS/{team}/{pipeline}{stage}DLQ |
Name of the SQS dead-letter queue | |
/SDLF/Pipelines/{team}/{pipeline}/{stage} |
placeholder |