Batch Ingestion

Batch ingestion is the process of collecting and transferring large volumes of data at intervals (e.g., hourly, daily) rather than in real-time.

Key points:

Data is ingested periodically.
Processing happens on entire datasets rather than individual events.
Often used for analytics, reporting, and ETL pipelines.

Category	Tool	Notes / Features
ETL Platforms	Apache NiFi	Can schedule flows, ingest large datasets from DBs, APIs, or files.
	Informatica PowerCenter	Enterprise ETL tool for batch pipelines with GUI and transformations.
	Talend	Open-source and cloud ETL tool for batch and some streaming.
Big Data ETL / Processing	Apache Spark	Batch processing engine; reads from HDFS, S3, or databases.
	Apache Hive	SQL-like queries over batch datasets stored in HDFS or cloud storage.
	Presto / Trino	Distributed SQL query engines; used to ingest/analyze batch datasets.
Cloud ETL Services	AWS Glue	Serverless ETL for batch pipelines; integrates with S3, Redshift, RDS.
	Google Dataflow	Can run batch pipelines; supports Apache Beam SDK.
	Azure Data Factory	Orchestrates batch ETL from multiple sources to sinks.
Data Movement / Orchestration	Sqoop	For batch ingestion from relational DBs into Hadoop / data lake.
	Airbyte / Fivetran	Cloud connectors for batch extraction from SaaS apps and DBs.
	Apache Oozie / Airflow	Workflow orchestrators that schedule batch ingestion and processing.

Typical Batch Ingestion Architecture

			
Source Systems (DB, SaaS, APIs)
        │
        ▼
Fivetran / ETL tool
        │  (Extract + Load)
        ▼
Data Warehouse / Data Lake
        │
        ▼
dbt (Transform)
        │  (Clean, model, aggregate)
        ▼
Analytics / BI / ML

		

			
Cloud Storage / SaaS
      │
  Batch Ingestion (Fivetran, Glue, Dataflow, Data Factory)
      │
Data Lake (S3 / GCS / ADLS)
      │
Batch Processing (Spark, Glue, Dataflow)
      │
Data Warehouse (Redshift, BigQuery, Snowflake)
      │
Analytics / Dashboards / ML

		

			
Source Systems (DB, SaaS, Logs)
            │
   Batch Extract (NiFi, Sqoop, Fivetran)
            │
     Landing Zone / Raw Data Lake (S3, HDFS, GCS)
            │
    Transformation / Enrichment (Spark, Glue)
            │
      Curated Zone / Analytics DB
            │
      BI / Machine Learning