When we are designing data pipelines, it is important to start by asking the following questions:
1. Requirements and Goals
- What is the primary goal of this pipeline?
- What types of data are we processing (structured, semi-structured, unstructured)?
- Structured data: Data that is organized in a fixed schema
- Semi-structured: Data that doesn’t follow a rigid schema but still has some organizational markers
- Unstructured: Data that has no predefined structure, and cannot easily fit into rows and columns.
- How often does the data need to be available for consumers? (real-time, near-real-time, daily batch)
- Are there any SLAs for latency or throughput?
- Latency: The time it takes for a single piece of data to travel through the system from input to output.
- Throughput: The amount of data a system can process in a given time period.
- How large is the expected data volume now, and how fast do we expect it to grow?
2. Sources and Destinations
- What are the data sources? (databases, APIs, logs, IoT devices, external systems)
- Are there multiple sources with different reliability characteristics?
- Where should the processed data be stored or consumed from? (data warehouse, lake, dashboards, ML models)
3. Data Processing Expectations
- Do we need complex transformations, aggregations, or enrichment?
- Should the pipeline handle streaming, batch, or a mix of both?
- Are there specific data quality requirements? (e.g., deduplication, schema validation)
4. Reliability and Fault Tolerance
- What level of reliability is expected? (exactly-once, at-least-once, at-most-once)
- At-most-once:
- Each message is delivered zero or one time.
- Messages might be lost, but never duplicated.
- At-Least-Once:
- Each message is delivered one or more times
- Messages are never lost, but duplicates can occur.
- Exactly-Once
- Each message is delivered exactly once.
- No duplicates, no losses.
- Achieved with careful transactional processing or idempotent operations.
- At-most-once:
- Are there any retention requirements?
5. Operational Considerations
- Are there cost or resource constraints?
- What kind of monitoring or alerting is expected?
- Do we need to support schema evolution or backward compatibility?
- Schema Evolution:
- The ability of a system to adapt to changes in the data schema over time without breaking existing pipelines or applications.
- Backward Compatibility:
- A schema change is backward compatible if old consumers can still read new data without errors.
- Schema Evolution:
6. Security and Governance
- Are there access controls or privacy regulations we need to consider?
- Is data lineage or auditing required?
These questions will help us clearly understand the requirements, enabling us to design a pipeline that is robust, scalable, maintainable, and reliable.