DAGs

What is a DAG in Airflow?

HOW TO WRITE YOUR FIRST DAG IN APACHE AIRFLOW – AIRFLOW TUTORIALS.

In this episode, we will learn about what are DAGs, tasks, and how to write a DAG file for Airflow. This episode also covers some key points regarding DAG runs and Task instances.

00:08

Airflow DAG Tutorial

Hello everyone. Welcome back to Airflow 101. Till the last episode, we have discussed what and why Apache Airflow is and a brief overview of the architecture. In this episode, we will talk about DAGs and tasks which are the building blocks of any workflow. Also, I will show you guys how to define a DAG file and write your own DAG. Let’s start with directed acyclic graphs or DAG. 

Airflow DAGs

In order to learn about DAGs, we need to know what are directed graphs. In order to know about directed graphs, we need to know what is a graph. Do you see what I did there? I made a workflow. So let’s start with the graph. In mathematics, the graph is (let me take the pointer). In mathematics, a graph is something that has nodes and edges. A directed graph, as the name suggests, has directed edges. We can see that there is a directed edge from this node to this node. 

01:05

Airflow Workflows

Similarly, we have directed edges. Now, a cyclic graph, it is a self-explanatory name, is a graph that has no cyclic dependencies. For example, in this directed graph, if I follow this path, I go here, and you can see I’m stuck in a cyclic dependency. We don’t want cyclic dependencies in our workflows, so we use directed acyclic graphs. Let’s follow this path. If I start from here, now I’m here and like this is the end, right? I can start from here and I’ll reach here. 

So there are no cyclic dependencies. This is what we want in our workflow. Let’s talk about DAGs with respect to Airflow. In Airflow, a DAG is a collection of tasks with defined dependencies and properties and we define them using Python programming language. This is how a DAG looks. In the Airflow web server, we have two tasks dummy start and dummy end. 

02:07

DAGRun

And this is the dependency. So dummy end depends upon the dummy start. Another thing I would like to talk about is DAGRun. It is a metadata entry in the database that tells us how many times a DAG has run. A diagram can either be created by the scheduler or you can manually trigger a DAG to create its diagram. A DAG can have multiple DAGgers at any given point in time. 

Airflow Tasks

Let’s talk about tasks. A node in the DAG represents a task and tasks are the units of work in Airflow. Each task can be defined using an operator, sensor, or hook. On a top view, those all are classified as operators. We will talk more about what operators, sensors, or hooks are in the upcoming videos. Similarly, I would like to talk about task instances. A task instance is a runnable entity of a task and it is a run of a task at a point in time.

03:09

Task Instances

If we have a DAGger, a task, and a point in time, we can define a task instance. Task instances belong to DAGgerants and Tasks. They belong to a DAG. Let’s take an example here. This is my Airflow Web server. And this is an example, of DAG. 

This is the graph view, and you can see that we have multiple tasks and all the dependencies. 

03:38

Airflow Operator

Here we can see that it has some branches, python operators, and dummy operators. You don’t need to go into details about what these operators do, but we’ll talk about all of these in the upcoming videos. Okay, let’s just turn on this tag. 

Airflow Diagram

Right now we have one diagram, and these are the task instances, right? If I just trigger this tag again, it has two diagrams. If I trigger it again, it now has three diagrams. It tells me that this tag has three running instances and we have multiple task instances. 

Airflow Instances

If we want to see all these task instances, we can just switch over to this preview and it will give us the list of all the task instances corresponding to all the DAGgers, right? We have these three DAGgers and these are the task instances corresponding to them. 

04:48

Airflow Trigger

We can see that this one is already been completed and the restaurant is running. If I just trigger this tag, I turn it on. So I have one DAGger run. If I trigger it again, I’ll have two diagrams, right? I hope that the concept of DAG runs and task instances is clear to you all.

05:18

Create a DAG

Now let’s talk about how to define a DAG. I have broken down the entire process of writing a DAG file into five smaller steps. 

  • The first one is importing modules. 
  • The second is defining default arguments. 
  • Third, creating a DAG object. 
  • Fourth defining tasks. 
  • And fifth, we’ll Define our dependencies. 

Bring up your favorite text editor and let’s get started. So I will be using Sublime Text. So let’s start with step one. Right. Step one is Importing Modules. The first thing that we need to import is from Airflow import DAG. The next thing is that we need to import some modules related to date and time. From date, time, import date, time, and something called Time Delta. Next is from we will be importing an operator. Right? Because we need to define tasks. In order to define the task, we will be using operators. For now, I will be using a dummy operator, which, as the name suggests, does nothing. In the upcoming videos, we will be using more advanced operators and writing more complicated workflows. 

06:33

From Airflow operators, dummy operators, port, dummy operators. Right. Step two. So, default_args is a dictionary that we pass to the Airflow object and it contains some metadata. So one is the owner. If you want to know more about all the keys in the default tags, I strongly recommend that you check Airflow’s official documentation. This depends on the past. Let’s keep it false if I’m not explaining any key. We’ll talk about that in the upcoming videos. 

Define DAG

Because right now, I just want you all to know how you can define your DAG file. We can talk about all these keys in the default tags in the upcoming videos. I’ll make a separate video for that start date. When we want this DAG to start date time, let’s give it 20. And step three, creating a DAG object. Let’s name a tag. We need to pass in a tag ID, which is a unique Identifier. To identify a tag. I’ll keep it as tag one.

08:13

default_args

Next is default_args. We need to pass in the default_args. Catch-up is equal to false. We’ll talk more about catching up and backfilling in episode eight. So don’t get confused right now. And the most important schedule interval. It tells the scheduler when to schedule the tag. Let’s keep it once. So this tag will only run once. Step four is creating a task. So I’ll be using a dummy operator. 

Again, we need to pass task ID, which is a unique Identifier. Just name it starts. We need to pass in the dark object DAG. Right? Let me just copy it and let’s name it End. Step five will be creating the dependency. You can create it just like this, right? The end task depends upon the start task. Right? Now, let’s save it. In order for Airflow to pick up this DAG file, we need to place it in the dark folder, right? 

09:26

Airflow Directory

If you have watched my video where I have shown how to install Airflow, there, I have mentioned that Airflow creates a directory by the name of Airflow in your home directory. And there we created a DAG folder. I’m going to save this over here. Let me name it like tag underscore one PY. Okay, let’s move on to the web server. It’ll take some time to pick it up. So we have our DAG file here.

So Airflow periodically checks the DAG folder. We have certain configurations that we can change to make it more frequent, right? Right. So here is the tag. This is how it looks. It has two tasks, start and End. Let’s turn it on. Let’s refresh it. Right? So this is our DAG. That’s how you can create your own DAG file, right? Just try to play around with DAG files and try to create more complicated tags. 

I’ll see you all in the next video. Thank you all for watching.