Batch Ingestion

Batch ingestion is the process of collecting and transferring large volumes of data at intervals (e.g., hourly, daily) rather than in real-time.

Key points:

  • Data is ingested periodically.
  • Processing happens on entire datasets rather than individual events.
  • Often used for analytics, reporting, and ETL pipelines.
CategoryToolNotes / Features
ETL PlatformsApache NiFiCan schedule flows, ingest large datasets from DBs, APIs, or files.
Informatica PowerCenterEnterprise ETL tool for batch pipelines with GUI and transformations.
TalendOpen-source and cloud ETL tool for batch and some streaming.
Big Data ETL / ProcessingApache SparkBatch processing engine; reads from HDFS, S3, or databases.
Apache HiveSQL-like queries over batch datasets stored in HDFS or cloud storage.
Presto / TrinoDistributed SQL query engines; used to ingest/analyze batch datasets.
Cloud ETL ServicesAWS GlueServerless ETL for batch pipelines; integrates with S3, Redshift, RDS.
Google DataflowCan run batch pipelines; supports Apache Beam SDK.
Azure Data FactoryOrchestrates batch ETL from multiple sources to sinks.
Data Movement / OrchestrationSqoopFor batch ingestion from relational DBs into Hadoop / data lake.
Airbyte / FivetranCloud connectors for batch extraction from SaaS apps and DBs.
Apache Oozie / AirflowWorkflow orchestrators that schedule batch ingestion and processing.

Typical Batch Ingestion Architecture

Source Systems (DB, SaaS, APIs)
Fivetran / ETL tool
│ (Extract + Load)
Data Warehouse / Data Lake
dbt (Transform)
│ (Clean, model, aggregate)
Analytics / BI / ML
Cloud Storage / SaaS
Batch Ingestion (Fivetran, Glue, Dataflow, Data Factory)
Data Lake (S3 / GCS / ADLS)
Batch Processing (Spark, Glue, Dataflow)
Data Warehouse (Redshift, BigQuery, Snowflake)
Analytics / Dashboards / ML
Source Systems (DB, SaaS, Logs)
Batch Extract (NiFi, Sqoop, Fivetran)
Landing Zone / Raw Data Lake (S3, HDFS, GCS)
Transformation / Enrichment (Spark, Glue)
Curated Zone / Analytics DB
BI / Machine Learning