Defining scope and constraints for Data Engineering pipelines

When we are designing data pipelines, it is important to start by asking the following questions:

1. Requirements and Goals

  • What is the primary goal of this pipeline?
  • What types of data are we processing (structured, semi-structured, unstructured)?
    • Structured data: Data that is organized in a fixed schema
    • Semi-structured: Data that doesn’t follow a rigid schema but still has some organizational markers
    • Unstructured: Data that has no predefined structure, and cannot easily fit into rows and columns.
  • How often does the data need to be available for consumers? (real-time, near-real-time, daily batch)
  • Are there any SLAs for latency or throughput?
    • Latency: The time it takes for a single piece of data to travel through the system from input to output.
    • Throughput: The amount of data a system can process in a given time period.
  • How large is the expected data volume now, and how fast do we expect it to grow?

2. Sources and Destinations

  • What are the data sources? (databases, APIs, logs, IoT devices, external systems)
  • Are there multiple sources with different reliability characteristics?
  • Where should the processed data be stored or consumed from? (data warehouse, lake, dashboards, ML models)

3. Data Processing Expectations

  • Do we need complex transformations, aggregations, or enrichment?
  • Should the pipeline handle streaming, batch, or a mix of both?
  • Are there specific data quality requirements? (e.g., deduplication, schema validation)

4. Reliability and Fault Tolerance

  • What level of reliability is expected? (exactly-once, at-least-once, at-most-once)
    • At-most-once:
      • Each message is delivered zero or one time.
      • Messages might be lost, but never duplicated.
    • At-Least-Once:
      • Each message is delivered one or more times
      • Messages are never lost, but duplicates can occur.
    • Exactly-Once
      • Each message is delivered exactly once.
      • No duplicates, no losses.
      • Achieved with careful transactional processing or idempotent operations.
  • Are there any retention requirements?

5. Operational Considerations

  • Are there cost or resource constraints?
  • What kind of monitoring or alerting is expected?
  • Do we need to support schema evolution or backward compatibility?
    • Schema Evolution:
      • The ability of a system to adapt to changes in the data schema over time without breaking existing pipelines or applications.
    • Backward Compatibility:
      • A schema change is backward compatible if old consumers can still read new data without errors.

6. Security and Governance

  • Are there access controls or privacy regulations we need to consider?
  • Is data lineage or auditing required?

These questions will help us clearly understand the requirements, enabling us to design a pipeline that is robust, scalable, maintainable, and reliable.

Leave a comment