Defining scope and constraints for Data Engineering pipelines

When we are designing data pipelines, it is important to start by asking the following questions:

1. Requirements and Goals

What is the primary goal of this pipeline?
What types of data are we processing (structured, semi-structured, unstructured)?
- Structured data: Data that is organized in a fixed schema
- Semi-structured: Data that doesn’t follow a rigid schema but still has some organizational markers
- Unstructured: Data that has no predefined structure, and cannot easily fit into rows and columns.
How often does the data need to be available for consumers? (real-time, near-real-time, daily batch)
Are there any SLAs for latency or throughput?
- Latency: The time it takes for a single piece of data to travel through the system from input to output.
- Throughput: The amount of data a system can process in a given time period.
How large is the expected data volume now, and how fast do we expect it to grow?

2. Sources and Destinations

What are the data sources? (databases, APIs, logs, IoT devices, external systems)
Are there multiple sources with different reliability characteristics?
Where should the processed data be stored or consumed from? (data warehouse, lake, dashboards, ML models)

3. Data Processing Expectations

Do we need complex transformations, aggregations, or enrichment?
Should the pipeline handle streaming, batch, or a mix of both?
Are there specific data quality requirements? (e.g., deduplication, schema validation)

4. Reliability and Fault Tolerance

What level of reliability is expected? (exactly-once, at-least-once, at-most-once)
- At-most-once:
  - Each message is delivered zero or one time.
  - Messages might be lost, but never duplicated.
- At-Least-Once:
  - Each message is delivered one or more times
  - Messages are never lost, but duplicates can occur.
- Exactly-Once
  - Each message is delivered exactly once.
  - No duplicates, no losses.
  - Achieved with careful transactional processing or idempotent operations.
Are there any retention requirements?

5. Operational Considerations

Are there cost or resource constraints?
What kind of monitoring or alerting is expected?
Do we need to support schema evolution or backward compatibility?
- Schema Evolution:
  - The ability of a system to adapt to changes in the data schema over time without breaking existing pipelines or applications.
- Backward Compatibility:
  - A schema change is backward compatible if old consumers can still read new data without errors.

6. Security and Governance

Are there access controls or privacy regulations we need to consider?
Is data lineage or auditing required?

These questions will help us clearly understand the requirements, enabling us to design a pipeline that is robust, scalable, maintainable, and reliable.