Software Development

Getting Started with Apache Airflow

1. Introduction

This is an in-depth article related to Apache Airflow.

Airflow was created in Airbnb in 2015. Airflow has 1000 contributors in the Apache Community. Apache Airflow is used for automating tasks to improve efficiency. These tasks are related to data collection, preprocessing, uploading, and reporting.

2. Apache Airflow

Apache Airflow is used as a workflow Automator and scheduler using data pipelines. The pipelines can be executed in a specified order using the appropriate resources. It has a good user interface to track and manage the workflows.

2.1 Prerequisites

Python 3.8.8 is required on windows or any operating system. Pycharm is needed for python programming. 

2.2 Download

Python 3.8.8 can be downloaded from the website. Pycharm is available at this link.

2.3 Setup

2.3.1 Python Setup

To install python, the download package or executable needs to be executed.

2.3.2 Airflow Setup

You can install Apache Airflow by using the command below:

Airflow installation

pip3 install apache-airflow

2.4 Features

Apache Airflow is easy to use. It is open-source and part of the Apache community. It has integrations that can be executed on different cloud platforms. Python is the language that you can use for creating the workflows. The user interface helps in tracking the status of the workflow tasks.

2.5 Components

Apache Airflow has components which are DAG, Webserver, Metadata database, and scheduler. DAG is a directed acyclic graph and is used for displaying the task relationships. Python language is used to create a DAG. A Webserver is based on Flask. Airflow task information is stored in the database. A Scheduler is used for updating the task status in the metadata database. The data pipelines can be dynamic and configured in python code. They are extensible and you can use different operators and executors. They can be parameterized using python scripts. Airflow has a modular architecture and is scalable. It uses a message queue to communicate with a different number of workers.

2.6 Apache Airflow HelloWorld

You need to initialize the Airflow Database using the command below:

Airflow Database Setup

$ source activate airflow-tutorial
$ export AIRFLOW_HOME="$(pwd)"
airflow db init

Next, you can create the admin user using the command below:

Airflow Admin user creation

airflow users  create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

The output for the command when executed will be as below:

Airflow Admin user creation Output

(airflow-venv) apples-MacBook-Air:apacheairflow bhagvan.kommadi$ airflow users  create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
[2022-01-03 01:54:50,763] {manager.py:763} WARNING - No user yet created, use flask fab command to do it.
[2022-01-03 01:54:51,295] {manager.py:512} WARNING - Refused to delete permission view, assoc with role exists DAG Runs.can_create Admin
[2022-01-03 01:54:53,936] {manager.py:214} INFO - Added user admin
User "admin" created with role "Admin"

You can now write the hello world dag using the code below.

Airflow HelloWorld

from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

def print_hello_world():
    return 'Hello world!'

dag = DAG('hello_world', description='Hello World DAG',
          schedule_interval='0 12 * * *',
          start_date=datetime(2022, 1, 6), catchup=False)

dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)

hello_world_operator = PythonOperator(task_id='hello_world_task', python_callable=print_hello_world, dag=dag)

dummy_operator >> hello_world_operator

After placing the python file in the dags folder of Airflow Home. You can use the command below:

Airflow WebServer

airflow webserver -p 8081

You need to start the Airflow scheduler using the command below:

Airflow Scheduler

airflow scheduler

After logging in using the admin username and the password on the web application (http://localhost:8081), You can see the DAG created using the code above in the list of DAGs..

Apache Airflow - helloworld
HelloWorld DAG

You can click on the hello_world DAG and execute the tasks. The output will be as shown in the graph view below.

DAG_Graph_View

You can see the Hello World task log after clicking on the status popup. The log is shown below:

DAG_Task_Log_view

2.7 Use Cases

Apache Airflow is used for the following tasks:

  • Sequencing
  • Coordination
  • Scheduling
  • Complex Data operations

They are used in different applications in the areas mentioned below:

  • Business Intelligence Applications
  • Data Science Applications
  • Machine Learning Models
  • Big data Apps

Apache Airflow is extensible and is used for different use cases. The models use dependency-based declaration. The steps can be created in different units. Python can be used for creating pipelines. This helps in versioning and change management.

3. Download the Source Code

Download
You can download the full source code of this example here: Getting Started with Apache Airflow

Bhagvan Kommadi

Bhagvan Kommadi is the Founder of Architect Corner & has around 20 years’ experience in the industry, ranging from large scale enterprise development to helping incubate software product start-ups. He has done Masters in Industrial Systems Engineering at Georgia Institute of Technology (1997) and Bachelors in Aerospace Engineering from Indian Institute of Technology, Madras (1993). He is member of IFX forum,Oracle JCP and participant in Java Community Process. He founded Quantica Computacao, the first quantum computing startup in India. Markets and Markets have positioned Quantica Computacao in ‘Emerging Companies’ section of Quantum Computing quadrants. Bhagvan has engineered and developed simulators and tools in the area of quantum technology using IBM Q, Microsoft Q# and Google QScript. He has reviewed the Manning book titled : "Machine Learning with TensorFlow”. He is also the author of Packt Publishing book - "Hands-On Data Structures and Algorithms with Go".He is member of IFX forum,Oracle JCP and participant in Java Community Process. He is member of the MIT Technology Review Global Panel.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button