Apache Kafka

  • What is?
    • It is an open-source, distributed, event-streaming platform
    • Designed to:
      • handle large quantities of data
      • be Scalable to various situations
  • What is an event?
    • Obtaining information from various sources
    • Reliably storing that data
    • Providing it to any users or systems needing access
  • Kafka Components:
    • Topics:
      • Common message type stored within KafkaNotepad where Kafka stores the events or messages it receivesThere can be any number of topics within a Kafka system
    • Producer
      • Writes events to various topics
      • A producer can write to a single or multiple topics
    • Consumer
      • Reads the information from topics

Kafka Producers:

  • What is a producer?
    • Also called the publisher
    • It is the component that writes messages to Kafka topics
      • Messages sent via producers are stored on Kafka for later use
      • The goal is to retain the data written to Kafka for as long as possible
    • Many producers can communicate with Kafka at the same time
    • Each producer can write to a single or multiple topics required
  • Types:
    • Command-line producer:
      • Included in the Kafka software installation
        • kafka-console-producer.sh
          • Found in the /bin folder
            • –bootstrap-server – Specifies which server to use
              • –bootstrap-server localhost:9092
            • –topic – Specifies the topic to write to
              • –topic phishing-sites
              • This is an open connection where you can type a message
    • Example:

Interactive:

bin/kafka-console-producer.sh –bootstrap-server localhot:9092 –topic testing  
>This is the first message
>This is the second message
>This is the third message  
<ctrl+c to exit>

Non-interactive

echo “This is the first message” | bin/kafka-console-producer.sh –bootstrap-server localhost:9092 –topic testing
  • Python
  • Java
  • Many other languages
  • Kafka Connect
    • Designed to interact with existing systems:
      • Relational databases
    • Automatically transfer data between them

Kafka Consumers

  • A consumer is the reader portion of the Kafka platform
  • It is also called the subscriber
  • Consumers can read messages stored within Kafka
  • There can be many consumers communicating with Kafka system at a time
  • A consumer can read from a single or multiple topics at a time
  • Types:
    • Command-line
      • Kafka-console-consumer.sh
        • Found in the /bin folder
        • -bootstrap-server – Specifies which server to use
          • –bootstrap-server localhost:9092
        • –topic – Specifies to topic to write
          • ––topic phishing-sites
          • This open a connection where you can read a message
        • –from-beginning
          • Tells the Kafka system to send all messages in topic, even ones that may have been already read
        • –max-messages
          • Used to read up to a maximum number of messages before exiting

Interactive

bin/kafka-console-consumer.sh –bootstrap-server localhot:9092 –topic testing –from-beginning –max-messages 2
>This is the first message
>This is the second message
<automatically exit>
    • Python
    • Java
    • Many other languages
    • Kafka Connect

Kafka Architecture

Components

Kafka ServersKafka Clients
A cluster of one or more computersStore dataManage communication with Kafka clients Handle integration with other systems DatabasesLog Files  Read data via Kafka consumersWrite data via Kafka producersProcess data locally and then store that information elsewhere or back to Kafka itself

Kafka Server

  • Primary task: Act like a Kafka broker
    • Handle communication between consumers and producers
  • Handles data storage
    • Data written by producers is stored and organized via topics
      • Topics are partitioned – they are stored in separated pieces
        • The individual messages are stored in a given partition based on an event id – e.g. customer_id, location_id
        • Each message will be retrieved in the order it is written to the topic
Partition & Replication – Kafka Servers
  • Kafka is fault-tolerant
    • If one system goes offline, others can provide the requested data
  • For each kafka cluster, there is a replication factor
    • Replication factor = Max number of failures – 1
    • Max replication factor == number of servers
  • Copying partitions is how replication is handled
  • Kafka broker with 3 clusters
  • 3 topics
  • 2x replication
    • There is a copy for each topic
  • Each topic is shared across 2 brokers within the cluster
  • If Broker 2 fails
    • Replication factor 2x
    • 2 -1 == can handle 1 failed system
    • Each topic still has at least one copy on the cluster
Kafka Clusters
  • Zookepeer
    • A software framework for
      • managing information
      • providing services necessary for running distributed systems
    • Used to created distributed applications
      • Users use zookeepers to manage those distributed applications
    • What does it do?
      • Handling any configuration information
      • Naming of systems to prevent conflicts
      • Provides the ability to synchronize across systems
        • Which systems are available
        • When they should start
        • How services can reach them
      • Provide services for a group of systems to communicate
    • Since it is a framework, it is not required to implement a custom version of these services
      • This allows for easier configuration, implementation, and interaction
  • Kafka uses Zookeeper for cluster management
    • New versions of Kafka can use Kraft
    • Two files used for server/cluster setup:
      • Config/zookeeper.properties
        • Handles information needed for a basic zookeeper setup including where to store zookeeper data and what network port to run on
      • Config/server.properties
        • This is the primary Kafka server configuration
        • Define information specific to Kafka installations
          • Details about:
            • Kafka brokers
            • Network configuration
            • Where to store events
            • Basic topic configuration
              • Replication details
  • How to start a Kafka cluster?
    • Kafka servers are started in two steps:
      • Start up the basic zookeeper server and configuration details
        • Bin/zookeeper-server-start.sh config/zookeeper.properties  
      • Start up the actual Kafka services as defined in the server.properties file
        • Bin/kafka-server-start.sh config/server.properties

  • How to stop a kafka cluster?
    • We use the reverse order of shutdown because kafka should first cleanly shutdown any open connections and because it uses zookeeper services.
      • Bin/kafka-server-stop.sh
      • Bin/zookeeper-server-stop.sh

Kafka Topics

  • What is a topic?
    • Logical grouping of events where each event or message is referring to the same type of thing
      • E.g. orders, downloads, virus detections, etc.
    • A topic is similar to a table in a relational database, where each event has a similar structure and a meaning
    • Also referred as an event log
      • Messages written to the log are immutable
      • Messages can be read or created, but they cannot be modified
    • A message will be only removed based on age, but otherwise, it is available to be read at any point in the future, given enough storage space

  • Create topics
    • There are multiple ways to create topics
    • We can use bin/kafka-topics.sh script to create topics
bin/kafka-topics.sh
--bootstrap-server localhot:9092
--topic <topic_name>
--create
--replication-factor <x> # Specifies how many copies of a topic exist within a kafka cluster
--partitions <x> # Specifies the number of partitions to split the topic into. This is a performance optimization but allows Kafka to handle higher throughput

  • Describe topics
    • –describe to get details about the configuration of a Kafka topic
bin/kafka-topics.sh –bootstrap-server <server> --topic <topic_name> --describe

  • Removing topics
    • –delete
      • to remove the topic from a Kafka serve/cluster
      • It also removes all messages in the topic
Bin/kafka-topics.sh –bootstrap-server <server> --topic <topic_name> --delete