Skip to content

sdlf-foundations

Note

sdlf-foundations is defined in the sdlf-foundations folder of the SDLF repository.

Infrastructure

SDLF Foundations

sdlf-foundations contains, as the name implies, foundational resources of a data lake. Data in a data lake can be broadly categorized across three distinct layers, with dedicated buckets:

  • Raw: All data ingested from various data sources into the data lake lands in the raw bucket, in their original data format (raw). This can include structured, semi-structured, and unstructured data objects such as databases, backups, archives, JSON, CSV, XML, text files, or images.
  • Stage: After raw data have been transformed or normalized through data pipelines, it is stored in a staging bucket (also called transformed). In this stage, data can be transformed into columnar data formats such as Apache Parquet and Apache ORC. These formats can be used by Amazon Athena.
  • Analytics: Transformed data can be further enriched by blending other datasets to provide additional insights. This layer typically contains S3 objects which are optimized for analytics, reporting using Amazon Athena, Amazon Redshift Spectrum, and loading into massively-parallel processing data warehouses such as Amazon Redshift.

An important building block of a data lake on AWS is AWS Lake Formation. It allows managing fine-grained data access permissions and share data. sdlf-foundations enables Lake Formation and register the previous buckets so that they can be managed through it.

Warning

In an existing environment where IAM and resource policies are used extensively to manage permissions, or Lake Formation already has data lake administrators defined, sdlf-foundations can have unintended consequences. Please review the official documentation on Lake Formation settings.

In addition to the above, sdlf-foundations creates three more buckets useful for data operations:

  • Artifacts to store artifacts such as Glue job scripts and Lambda function source files.
  • Athena to hold Athena query results.
  • Logs for S3 access logs to the other buckets. It can also be used as a target for other types of logs produced by data pipelines if needed.

Everything is encrypted using SSE-KMS with S3 Bucket Keys.

sdlf-foundations also puts in place a number of DynamoDB tables:

Table Description SSM Parameter
octagon-ObjectMetadata-{environment}{customSuffix} Metadata for all objects in the data lake raw & stage buckets (populated by sdlf-catalog) /SDLF2/Dynamo/ObjectCatalog
octagon-Datasets-{environment}{customSuffix} Metadata for all user-defined datasets (populated by sdlf-dataset) /SDLF2/Dynamo/TransformMapping, /SDLF2/Dynamo/Datasets
octagon-Artifacts-{environment}{customSuffix} Metadata for all artifacts (currently empty, soon populated by sdlf-cicd)
octagon-Metrics-{environment}{customSuffix} User-provided data pipeline metrics
octagon-Configuration-{environment}{customSuffix} User-provided key-value configurations
octagon-Teams-{environment}{customSuffix} Metadata for all data teams and notification topics (populated by sdlf-team) /SDLF2/Dynamo/TeamMetadata
octagon-Pipelines-{environment}{customSuffix} Metadata for all data pipeline stages (populated by sdlf-team) /SDLF2/Dynamo/Pipelines
octagon-Events-{environment}{customSuffix} Logging (unused)
octagon-PipelineExecutionHistory-{environment}{customSuffix} Track pipeline execution progress and history (populated by sdlf-stageA, sdlf-stageB)
octagon-DataSchemas-{environment}{customSuffix} Structure of all datasets (populated by sdlf-replicate, based on Glue catalog) /SDLF2/Dynamo/DataSchemas
octagon-Manifests-{environment}{customSuffix} Track manifests and files for manifest-file-based processing /SDLF2/Dynamo/Manifests

The sdlf-catalog{customSuffix} Lambda function is used to populate octagon-ObjectMetadata automatically whenever a new object is uploaded or deleted from the raw and stage buckets. The sdlf-replicate{customSuffix} Lambda function copies the schema of any new Glue database or Glue database whose schema is updated.

SSM parameters holding names or ARNs are created for all resources that may be used by other modules.

Warning

The data lake admin team should be the only one with write access to the sdlf-foundations code base, as it can impact an entire data lake environment including data pipelines within it.

Usage

CloudFormation with sdlf-cicd

Read the official SDLF workshop for an end-to-end deployment example.

rExample:
    Type: awslabs::sdlf::foundations::MODULE
    Properties:
        pPipelineReference: !Ref pPipelineReference
        pChildAccountId: 111111111111
        pOrg: forecourt
        pDomain: proserve
        pEnvironment: dev

Interface

Interfacing with other modules is done through SSM Parameters. sdlf-foundations publishes the following parameters:

SSM Parameter Description Comment
/SDLF2/Dynamo/ObjectCatalog Name of the DynamoDB used to store metadata
/SDLF2/Dynamo/TransformMapping Name of the DynamoDB used to store mappings to transformation
/SDLF2/Dynamo/Pipelines Name of the DynamoDB used to store pipelines metadata
/SDLF2/Dynamo/TeamMetadata Name of the DynamoDB used to store teams metadata
/SDLF2/Dynamo/DataSchemas Name of the DynamoDB used to store data schemas
/SDLF2/Dynamo/Manifests Name of the DynamoDB used to store manifest process metadata
/SDLF2/IAM/LakeFormationDataAccessRole Lake Formation Data Access Role name
/SDLF2/IAM/LakeFormationDataAccessRoleArn Lake Formation Data Access Role ARN
/SDLF2/IAM/DataLakeAdminRoleArn Lake Formation Data Lake Admin Role ARN
/SDLF2/KMS/KeyArn ARN of the default KMS key used for encrypting data lake buckets
/SDLF2/Misc/pOrg Name of the Organization owning the datalake
/SDLF2/Misc/pDomain Data domain name
/SDLF2/Misc/pEnv Environment name
/SDLF2/S3/AccessLogsBucket Access Logs S3 bucket name
/SDLF2/S3/ArtifactsBucket Artifacts S3 bucket name
/SDLF2/S3/CentralBucket Central S3 bucket name deprecated, use /SDLF2/S3/RawBucket instead
/SDLF2/S3/RawBucket Raw S3 bucket name
/SDLF2/S3/StageBucket Stage S3 bucket name
/SDLF2/S3/AnalyticsBucket Analytics S3 bucket name
/SDLF2/S3/AthenaBucket Athena query results S3 bucket name