Hello everyone. Welcome back to Airflow 101. Till the last episode we have discussed what and why Apache Airflow and a brief overview of the architecture. In this episode we will talk about DAGs and tasks which are the building blocks of any workflow. Also, I will show you guys as how to define a DAG file and write your own DAG. Let’s start with directed acyclic graphs or DAGs. In order to learn about DAGs, we need to know what are directed graphs. In order to know about directed graphs, we need to know what is a graph. You see what I did there? I made a workflow. So let’s start with graph. In mathematics, graph is let me take the pointer. In mathematics, graph is something which has nodes and edges. Directed graph, as the name suggests, has directed edges. We can see that there is a directed edge from this node to this node.
Similarly, we have directed edges. Now, directed a cyclic graph, it is self explanatory name is a graph which has no cyclic dependencies. For example, in this directed graph, if I follow this path, I go here, here and you can see I am stuck in a cyclic dependency. We don’t want cyclic dependencies in our workflows. So we use directed a cyclic graph. Let’s follow this path. If I start from here, now I’m here and this is the end, right? I can start from here and I’ll reach here.
There are no cyclic dependencies and this is what we want in our workflow. Let’s talk about DAGs with respect to Airflow. In Airflow, a DAG is a collection of tasks with defined dependencies and properties and we define them using Python programming language. This is how a DAG looks. In Airflow web server we have two tasks dummy start and dummy end. And this is the dependency. So dummy end depends upon dummy start. Another thing I would like to talk about is diagram. It is a metadata entry in the database that tells us how many times a DAG has run. A diagram can either be created by the scheduler or you can manually trigger a DAG to create its diagram. A DAG can have multiple diagrams at any given point of time. Let’s talk about take a note in the DAGs represents a task and tasks are the unit of work in Airflow.
Each task can be defined using operator, sensor or hook. On a top view, those all are classified as operators. We will talk more about what operator, sensor or hooks are in the upcoming videos. Similarly, I would like to talk about task instances. A task instance is a runnable entity of a task and it is run of a task in a point of time. If we have a DAG run, a task and a point in time, we can define a task instance. Task instances belong to DAGgers and tasks. They belong to a DAG. Let’s take an example here. This is my Airflow web server. And this is an example branch operator. This is the graph view, and you.
Can see that we have multiple tasks. And all the dependencies. Here we can see that it has some python branch, python operator, and dumbing operator. You don’t need to go into details. About what these operators do, but we’ll talk about all of these in the upcoming videos. Okay, let’s just turn on this DAG. Right now we have one diagram and these are the task instances, right? If I just trigger this DAG again, it has two diagrams. If I trigger it again, it now has three diagrams. It tells me that this DAG has three running instances and we have multiple task instances. If we want to see all these task instances, we can just switch over to this preview and it will give us the list of all the task instances corresponding to all the diagrams. We have these three diagrams and these are the task instances corresponding to them.
We can see that this one is Already completed and the restaurant are running. If I just trigger the stack or. Let me turn it on. So I have one diagram. If I again trigger it, I’ll have two diagrams, right? So I hope that the concept of. Tag runs and task instances is clear to you all. Now let’s talk about how to define a DAG. I have broken down the entire process.
Of writing a DAG file into five smaller steps. The first one is importing modules. The second is defining default arguments. Third, creating a DAG object. Fourth, defining tasks. And fifth, we’ll define our dependencies. Bring up your favorite text editor and let’s get started. So I will be using Sublime text. Let’s start with the step one, right? The step one is importing modules. The first thing that we need to import is from Airflow import DAG. Next thing is that we need to import some modules related to date and time. From date, time, import date, time and something called as Time Delta. Next is from we will be importing an operator. Right? Because we need to define task. In order to define task, we will be using operators. For now, I will be using a dummy operator, which, as the name suggests, does nothing. In the upcoming videos, we will be using more advanced operators and write more complicated workflows. So from Airflow operators, step two.
Step two is default Arcs. So, Default Arcs is a dictionary that we pass to the Airflow object and it contains some metadata. So one is owner. If you want to know more about this, all the keys in the default DAGs, I strongly recommend that you check Airflow’s official documentation. This depends on past. Let’s keep it false if I’m not explaining any key. We’ll talk about that in the upcoming videos. Because right now, I just want you all to know how you can define your DAG file, right? We can talk about all these keys in the default DAGs in the upcoming videos. I’ll make a separate video for that. Start. It when we want this DAGs to start date time. Let’s keep it step three, creating a DAGs object. Let’s name a DAG. We need to pass in a DAG ID, which is the unique Identifier to identify a DAG. I’ll keep it as DAG one.
Next is default. Argues to pass in the default DAGs. Catch up is false. We’ll talk more about catch up and backfill in episode eight. So don’t get confused right now. And the most important schedule interval. It tells the scheduler when to schedule the DAG. Let’s keep it once. So this DAG will only run once. Step four is creating a task. So I’ll be using dummy operator. Again, we need to pass task ID, which is a unique Identifier. Name. It start. We need to pass in the DAGs object. DAGs. Let me just copy it and let’s name it End. Step five will be creating the dependency. You can create it just like this, right? End task depends upon Start task, right? Now, let’s save it. In order to Airflow to pick up this DAG file, we need to place it in the dark folder, right? If you have watched my video where I have shown how to install Airflow, there, I have mentioned that Airflow creates a directory by the name of Airflow in your home directory.
And there we created a tax folder. I’m going to save this over here. Let me name it like PY. Okay, let’s move on to the web server. It will take some time to pick it up. So we have our DAG file here. So Airflow periodically checks the DAG folder. We have certain configurations that we can change to make it more frequent, right? So here is the DAG. This is how it looks. It has two tasks. Start and then let’s turn it on. Let’s refresh it, right? This is our DAG, and that’s how you can create your own DAG file, right? Just try to play around with DAG Files and try to create more complicated DAGs. I’ll see you all in the next video. Thank you all for watching.
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.
You’ll hear from us shortly.
See what our Open Data Lake Platform can do for you in 35 minutes.
We’ll be in Touch