Docker Kafka Topic Creation

YatinOctober 24th, 2023Last Updated: October 24th, 2023

2 524 6 minutes read

Apache Kafka is an open-source stream-processing software platform developed by the Apache Software Foundation. Kafka uses topics, partitions, and replication to manage data organization, parallel processing, and fault tolerance. It is designed to handle high-throughput, fault-tolerant, and scalable real-time data streaming. Kafka is often used for building real-time data pipelines and streaming applications. It enables the seamless integration of various data sources and applications, allowing them to communicate and process data in real time.

1. Introduction

Kafka is an open-source distributed streaming platform developed by LinkedIn and later donated to the Apache Software Foundation. It was designed to handle real-time data streams, making it a highly scalable, fault-tolerant, and distributed system for processing and storing large volumes of event data. Kafka is widely used for various use cases, such as log aggregation, event sourcing, messaging, and real-time analytics.

1.1 Key Concepts

Topics: Kafka organizes data streams into topics, which are similar to categories or feeds. Each topic consists of a stream of records or messages.
Producers: Producers are applications that publish data to Kafka topics. They write messages to specific topics, and these messages are then stored in the Kafka brokers.
Brokers: Kafka brokers are the nodes that form the Kafka cluster. They are responsible for receiving, storing, and serving messages. Each broker holds one or more partitions of a topic.
Partitions: Topics can be divided into multiple partitions, which are essentially ordered logs of messages. Partitions allow data to be distributed and processed in parallel across different brokers.
Consumers: Consumers are applications that read data from Kafka topics. They subscribe to one or more topics and receive messages from the partitions of those topics.
Consumer Groups: Consumers can be organized into consumer groups, where each group consists of one or more consumers. Each message in a partition is delivered to only one consumer within a group, allowing parallel processing of data.

1.2 How does Kafka work?

Data Ingestion: Producers send messages to Kafka brokers. Producers can choose to send messages synchronously or asynchronously.
Storage: Messages are stored in partitions within Kafka brokers. Each partition is an ordered, immutable sequence of messages.
Replication: Kafka provides fault tolerance through data replication. Each partition has one leader and multiple replicas. The leader handles read and write operations, while the replicas act as backups. If a broker fails, one of its replicas can be promoted as the new leader.
Retention: Kafka allows you to configure a retention period for each topic, determining how long messages are retained in the system. Older messages are eventually purged, making Kafka suitable for both real-time and historical data processing.
Consumption: Consumers subscribe to one or more topics and read messages from partitions. Consumers can process data in real-time or store it in a database for later analysis.

Overall, Kafka has become an essential component in modern data architectures, allowing organizations to handle large-scale event-driven data streams efficiently.

2. What is Docker?

Docker is an open-source platform that enables containerization, allowing you to package applications and their dependencies into standardized units called containers. These containers are lightweight, isolated, and portable, providing consistent environments for running applications across different systems. Docker simplifies software deployment by eliminating compatibility issues and dependency conflicts. It promotes scalability, efficient resource utilization, and faster application development and deployment. With Docker, you can easily build, share, and deploy applications in a consistent and reproducible manner, making it a popular choice for modern software development and deployment workflows. If someone needs to go through the Docker installation, please watch this video. Some benefits of Docker are:

Portability: Docker containers can run on any platform, regardless of the underlying infrastructure. This makes it easy to move applications between development, testing, and production environments.
Scalability: Docker allows you to quickly and easily scale your application by adding or removing containers as needed, without having to make changes to the underlying infrastructure.
Isolation: Docker provides a high level of isolation between applications, ensuring that each container runs independently of others, without interfering with each other.
Efficiency: Docker containers are lightweight and efficient, consuming fewer resources than traditional virtual machines. This allows you to run more applications on the same hardware.
Consistency: Docker ensures that applications behave the same way across different environments, making it easier to test and deploy new versions of your application.
Security: Docker provides built-in security features that help protect your applications from external threats. Docker containers are isolated from each other and the underlying infrastructure, reducing the risk of attacks.

Overall, Docker provides a powerful platform for building, testing, and deploying applications that are both efficient and reliable.

2.1 What is Docker used for?

It is used for –

For environment replication, while the code runs locally on the machine.
For numerous deployment phases i.e. Dev/Test/QA.
For version control and distributing the application’s OS within a team.

3. Setting up Apache Kafka on Docker

Using Docker Compose simplifies the process by defining the services, their dependencies, and network configuration in a single file. It allows for easier management and scalability of the environment. Make sure you have Docker and Docker Compose installed on your system before proceeding with these steps. To set up Apache Kafka on Docker using Docker Compose, follow these steps.

3.1 Creating Docker Compose file

Create a file called docker-compose.yml and open it for editing.

Zookeeper Service: The zookeeper service is defined to run Zookeeper, which is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. The configuration for this service is as follows:
- Image: The service uses the latest version of Confluent’s Zookeeper image (wurstmeister/zookeeper).
- Container Name: The name of the container is set to zookeeper.
- Ports: Zookeeper uses port 2181 for client connections, which are mapped from the host to the container.
Kafka Service: The kafka service is defined to run Apache Kafka, a distributed streaming platform. Here’s how the Kafka service is configured:
- Image: The service uses the latest version of Confluent’s Kafka image (wurstmeister/kafka).
- Container Name: The name of the container is set to kafka.
- Ports: Kafka uses port 9092 for client connections, which are mapped from the host to the container.
- Environment Variables: Several environment variables are set to configure Kafka. Notably, KAFKA_ZOOKEEPER_CONNECT specifies the Zookeeper connection string, KAFKA_ADVERTISED_LISTENERS defines the listener for client connections, and KAFKA_CREATE_TOPICS creates a topic named my-topic with 1 partition and a replication factor of 1.
- Depends On: The Kafka service depends on the zookeeper service, ensuring Zookeeper is running before Kafka starts.

Add the following content to the file and save it once done.

docker-compose.yml

version: '3'
services:
  zookeeper:
    image: wurstmeister/zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"

  kafka:
    image: wurstmeister/kafka
    container_name: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9093
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_CREATE_TOPICS: "my-topic:1:1"
    depends_on:
      - zookeeper

3.2 Running the Kafka containers

Open a terminal or command prompt in the same directory as the docker-compose.yml file. Start the Kafka containers by running the following command:

Start containers

docker-compose up -d

This command will start the ZooKeeper and Kafka containers in detached mode, running in the background.

To stop and remove the containers, as well as the network created, use the following command:

Stop containers

docker-compose down

3.3 Creating a Topic

After the Kafka cluster is operational, establish the Kafka topic. Go to the directory containing the docker-compose.yml file and execute the following command. It will establish the jcg-topic topic utilizing a Kafka broker operating on port number 9092.

Kafka Topic

docker-compose exec kafka kafka-topics.sh --create --topic jcg-topic --partitions 1 --replication-factor 1 --bootstrap-server kafka:9092

Fig. 2: Creating a Topic

3.4 Publishing and Consuming Messages

With the Kafka topic set up, let’s publish and consume messages. Start by initiating the Kafka consumer –

Kafka Consumer

docker-compose exec kafka kafka-console-consumer.sh --topic jcg-topic --from-beginning --bootstrap-server kafka:9092

The command allows the Kafka consumer to process messages from the jcg-topic topic. Using the --beginning flag ensures the consumption of all messages from the topic’s start.

Fig. 3: Messages received at Kafka Consumer

Initiate the Kafka producer using this command to generate and dispatch messages to the jcg-topic topic.

Kafka Producer

docker-compose exec kafka kafka-console-producer.sh --topic jcg-topic --broker-list kafka:9092

4. Conclusion

In conclusion, mastering the art of creating Kafka topics using Docker Compose is a fundamental skill for modern data engineers and developers. The provided docker-compose.yml explains a streamlined approach to setting up a Kafka cluster with Zookeeper, showcasing the simplicity and power of containerized deployments. By encapsulating Kafka services within Docker containers, developers can ensure consistency across different environments, making development, testing, and deployment more predictable and reliable. This method not only simplifies the setup process but also allows for easy scaling and replication of Kafka topics in real-world applications.