sdlf-dataset

Note

sdlf-dataset is defined in the sdlf-dataset folder of the SDLF repository.

Infrastructure

SDLF Dataset

A SDLF dataset is a logical construct referring to a grouping of data. It can be anything from a single table to an entire database with multiple tables for example. However, an overall good practice is to limit the infrastructure deployed to the minimum to avoid unnecessary overhead and cost. It means that in general, the more data is grouped together the better. Abstraction at the transformation code level can then help make distinctions within a given dataset.

Examples of datasets are:

A relational database with multiple tables (e.g. Sales DB with orders and customers tables)
A group of files from a data source (e.g. XML files from a Telemetry system)
A streaming data source (e.g. Kinesis data stream batching files and dumping them into S3)

sdlf-dataset creates a Glue database, as well as a Glue crawler.

SSM parameters holding names or ARNs are created for all resources that may be used by other modules.

Usage

CloudFormation with sdlf-cicd

Read the official SDLF workshop for an end-to-end deployment example.

rExample:
    Type: awslabs::sdlf::dataset::MODULE
    Properties:
        pPipelineReference: !Ref pPipelineReference
        pTeamName: iot
        pDatasetName: legislators

Interface

Interfacing with other modules is done through SSM Parameters. sdlf-dataset publishes the following parameters:

SSM Parameter	Description	Comment
`/SDLF/Datasets/{team}/{dataset}`	Dataset-specific metadata for data pipelines
`/SDLF/Glue/{team}/{dataset}/GlueCrawler`	Team dataset Glue crawler
`/SDLF/Glue/{team}/{dataset}/DataCatalog`	Team dataset metadata catalog"