sdlf-foundations

Note

sdlf-foundations is defined in the sdlf-foundations folder of the SDLF repository.

Infrastructure

SDLF Foundations

sdlf-foundations contains, as the name implies, foundational resources of a data lake. Data in a data lake can be broadly categorized across three distinct layers, with dedicated buckets:

Raw: All data ingested from various data sources into the data lake lands in the raw bucket, in their original data format (raw). This can include structured, semi-structured, and unstructured data objects such as databases, backups, archives, JSON, CSV, XML, text files, or images.
Stage: After raw data have been transformed or normalized through data pipelines, it is stored in a staging bucket (also called transformed). In this stage, data can be transformed into columnar data formats such as Apache Parquet and Apache ORC. These formats can be used by Amazon Athena.
Analytics: Transformed data can be further enriched by blending other datasets to provide additional insights. This layer typically contains S3 objects which are optimized for analytics, reporting using Amazon Athena, Amazon Redshift Spectrum, and loading into massively-parallel processing data warehouses such as Amazon Redshift.

An important building block of a data lake on AWS is AWS Lake Formation. It allows managing fine-grained data access permissions and share data. sdlf-foundations enables Lake Formation and register the previous buckets so that they can be managed through it.

Warning

In an existing environment where IAM and resource policies are used extensively to manage permissions, or Lake Formation already has data lake administrators defined, sdlf-foundations can have unintended consequences. Please review the official documentation on Lake Formation settings.

In addition to the above, sdlf-foundations creates three more buckets useful for data operations:

Artifacts to store artifacts such as Glue job scripts and Lambda function source files.
Athena to hold Athena query results.
Logs for S3 access logs to the other buckets. It can also be used as a target for other types of logs produced by data pipelines if needed.

Everything is encrypted using SSE-KMS with S3 Bucket Keys.

sdlf-foundations also puts in place a number of DynamoDB tables:

Table	Description	SSM Parameter
`octagon-ObjectMetadata-{environment}{customSuffix}`	Metadata for all objects in the data lake raw & stage buckets (populated by `sdlf-catalog`)	`/SDLF2/Dynamo/ObjectCatalog`
`octagon-Datasets-{environment}{customSuffix}`	Metadata for all user-defined datasets (populated by `sdlf-dataset`)	`/SDLF2/Dynamo/TransformMapping`, `/SDLF2/Dynamo/Datasets`
`octagon-Artifacts-{environment}{customSuffix}`	Metadata for all artifacts (currently empty, soon populated by `sdlf-cicd`)
`octagon-Metrics-{environment}{customSuffix}`	User-provided data pipeline metrics
`octagon-Configuration-{environment}{customSuffix}`	User-provided key-value configurations
`octagon-Teams-{environment}{customSuffix}`	Metadata for all data teams and notification topics (populated by `sdlf-team`)	`/SDLF2/Dynamo/TeamMetadata`
`octagon-Pipelines-{environment}{customSuffix}`	Metadata for all data pipeline stages (populated by `sdlf-team`)	`/SDLF2/Dynamo/Pipelines`
`octagon-Events-{environment}{customSuffix}`	Logging (unused)
`octagon-PipelineExecutionHistory-{environment}{customSuffix}`	Track pipeline execution progress and history (populated by `sdlf-stageA`, `sdlf-stageB`)
`octagon-DataSchemas-{environment}{customSuffix}`	Structure of all datasets (populated by `sdlf-replicate`, based on Glue catalog)	`/SDLF2/Dynamo/DataSchemas`
`octagon-Manifests-{environment}{customSuffix}`	Track manifests and files for manifest-file-based processing	`/SDLF2/Dynamo/Manifests`

The sdlf-catalog{customSuffix} Lambda function is used to populate octagon-ObjectMetadata automatically whenever a new object is uploaded or deleted from the raw and stage buckets. The sdlf-replicate{customSuffix} Lambda function copies the schema of any new Glue database or Glue database whose schema is updated.

SSM parameters holding names or ARNs are created for all resources that may be used by other modules.

Warning

The data lake admin team should be the only one with write access to the sdlf-foundations code base, as it can impact an entire data lake environment including data pipelines within it.

Usage

CloudFormation with sdlf-cicd

Read the official SDLF workshop for an end-to-end deployment example.

rExample:
    Type: awslabs::sdlf::foundations::MODULE
    Properties:
        pPipelineReference: !Ref pPipelineReference
        pChildAccountId: 111111111111
        pOrg: forecourt
        pDomain: proserve
        pEnvironment: dev

Interface

Interfacing with other modules is done through SSM Parameters. sdlf-foundations publishes the following parameters:

SSM Parameter	Description	Comment
`/SDLF2/Dynamo/ObjectCatalog`	Name of the DynamoDB used to store metadata
`/SDLF2/Dynamo/TransformMapping`	Name of the DynamoDB used to store mappings to transformation
`/SDLF2/Dynamo/Pipelines`	Name of the DynamoDB used to store pipelines metadata
`/SDLF2/Dynamo/TeamMetadata`	Name of the DynamoDB used to store teams metadata
`/SDLF2/Dynamo/DataSchemas`	Name of the DynamoDB used to store data schemas
`/SDLF2/Dynamo/Manifests`	Name of the DynamoDB used to store manifest process metadata
`/SDLF2/IAM/LakeFormationDataAccessRole`	Lake Formation Data Access Role name
`/SDLF2/IAM/LakeFormationDataAccessRoleArn`	Lake Formation Data Access Role ARN
`/SDLF2/IAM/DataLakeAdminRoleArn`	Lake Formation Data Lake Admin Role ARN
`/SDLF2/KMS/KeyArn`	ARN of the default KMS key used for encrypting data lake buckets
`/SDLF2/Misc/pOrg`	Name of the Organization owning the datalake
`/SDLF2/Misc/pDomain`	Data domain name
`/SDLF2/Misc/pEnv`	Environment name
`/SDLF2/S3/AccessLogsBucket`	Access Logs S3 bucket name
`/SDLF2/S3/ArtifactsBucket`	Artifacts S3 bucket name
`/SDLF2/S3/CentralBucket`	Central S3 bucket name	deprecated, use `/SDLF2/S3/RawBucket` instead
`/SDLF2/S3/RawBucket`	Raw S3 bucket name
`/SDLF2/S3/StageBucket`	Stage S3 bucket name
`/SDLF2/S3/AnalyticsBucket`	Analytics S3 bucket name
`/SDLF2/S3/AthenaBucket`	Athena query results S3 bucket name