Batch ingestion is the process of collecting and transferring large volumes of data at intervals (e.g., hourly, daily) rather than in real-time.
Key points:
- Data is ingested periodically.
- Processing happens on entire datasets rather than individual events.
- Often used for analytics, reporting, and ETL pipelines.
| Category | Tool | Notes / Features |
|---|---|---|
| ETL Platforms | Apache NiFi | Can schedule flows, ingest large datasets from DBs, APIs, or files. |
| Informatica PowerCenter | Enterprise ETL tool for batch pipelines with GUI and transformations. | |
| Talend | Open-source and cloud ETL tool for batch and some streaming. | |
| Big Data ETL / Processing | Apache Spark | Batch processing engine; reads from HDFS, S3, or databases. |
| Apache Hive | SQL-like queries over batch datasets stored in HDFS or cloud storage. | |
| Presto / Trino | Distributed SQL query engines; used to ingest/analyze batch datasets. | |
| Cloud ETL Services | AWS Glue | Serverless ETL for batch pipelines; integrates with S3, Redshift, RDS. |
| Google Dataflow | Can run batch pipelines; supports Apache Beam SDK. | |
| Azure Data Factory | Orchestrates batch ETL from multiple sources to sinks. | |
| Data Movement / Orchestration | Sqoop | For batch ingestion from relational DBs into Hadoop / data lake. |
| Airbyte / Fivetran | Cloud connectors for batch extraction from SaaS apps and DBs. | |
| Apache Oozie / Airflow | Workflow orchestrators that schedule batch ingestion and processing. |
Typical Batch Ingestion Architecture
Source Systems (DB, SaaS, APIs) │ ▼Fivetran / ETL tool │ (Extract + Load) ▼Data Warehouse / Data Lake │ ▼dbt (Transform) │ (Clean, model, aggregate) ▼Analytics / BI / ML
Cloud Storage / SaaS │ Batch Ingestion (Fivetran, Glue, Dataflow, Data Factory) │Data Lake (S3 / GCS / ADLS) │Batch Processing (Spark, Glue, Dataflow) │Data Warehouse (Redshift, BigQuery, Snowflake) │Analytics / Dashboards / ML
Source Systems (DB, SaaS, Logs) │ Batch Extract (NiFi, Sqoop, Fivetran) │ Landing Zone / Raw Data Lake (S3, HDFS, GCS) │ Transformation / Enrichment (Spark, Glue) │ Curated Zone / Analytics DB │ BI / Machine Learning