Hey everyone. Welcome back to Airflow 101. This is episode eight. In the last few videos we have talked about DAG states. How does the states change? In this video I would like to talk about a very important concept which is backfill and catch up. Along with that, we will also see how you can use cron expressions in your tags. We’ll see how the execution date are determined in the Airflow and how scheduler picks up the tags based on their execution date. So let’s get started. Let’s start with when a tag is triggered. Or more precisely, when does the scheduler says that I need this tag to be executed? When were writing a dark file in the default Arcs, we mentioned the start date. In this case, the start date is 13 April. Logically, the DAG should be triggered or should be picked up by scheduler on 13 April at midnight.
This is not how air flow works. The time at which DAG is created is equal to start date plus scheduled interval in the default. Or when were creating the DAG object, we provide one another key, which is the scheduled interval. In this case it is daily. This is a crown expression or a cron preset. We will talk about it in the upcoming slides. For now, all I need to say about this is that it means that schedule this tag every day. We want the tagaron to be created once every day. Now, the DAG will be created on 14 April at midnight because our start date is 13th and our scheduled interval is daily. 13 plus one, which is 14 and the DAG will be triggered on 14 April at midnight. I know this sounds very confusing, and it is very confusing, but this is how Apache Airflow works.
Now, let me explain this concept with the help of an example. Here I have my Airflow web server and we have two DAGs. Both have started as 10 July and scheduled over as daily. You can see today’s date is 10 July as well as stack two, the same start date and scheduled interval. Now let me turn on both of the stack and you can see that no diagram is created because the diagram should be created at midnight tonight. Let’s go ahead and change the start date to 9th for DAG two. Let’s refresh the page and we can see your DAGger is instantly created. I hope that I was able to make you understand this concept with the help of this example. Now let’s move on to the next topic. Another thing that I would like to talk about is the execution date and the start date.
If you open the tree view, you can see the rented, which is also known as the execution date and the start date. In this example, we can see that both of these are same. If I open the dock that was executed earlier, we can see that the execution date is 9 July and the starter date is 10 July. This is because the run date is the start of the scheduling interval, which in this case was 9 July, and the starter date is the time at which the diagram was created. If you manually trigger a tag, both run date and the starter date will be same. In Run ID manual will be prefixed. This helps you to identify which tags were manually triggered and which tag run was created by the scheduler. When defining the schedule in the world, or you have the option to provide a Cron expression.
Well, if you have a very complicated scheduling interval, then you can use a Cron expression. In simpler cases, you can use the presets that are by default provided. To know more about these presets, you can head on to this link I have provided. There you can understand what these presets mean. I have added the meaning to them. So, like the early means that each diagram will be created after an interval of one r Beatly means that a diagram will be created once in a week. If you have a very normal or not so complicated scheduling interval, you can use these presets. This will help you to save a lot of time and just make sure that you don’t have any bugs in your dark file. Now let’s understand one of the most important concepts in Apache Airflow backfill and catch up. Suppose you have a task that requires backfilling.
Apache Airflow gives you the functionality to backfill all of the diagrams whose time dependencies has been satisfied if you keep Catchup enabled. If you keep Catchup as false, only the latest diagram will be created. Now, let’s understand this with the help of an example. Here I have this snippet from my DAX file. Here you can see that the start date is 15 July and my schedule interval is early. If I turn on this tag at 15 July 04:10 A.m., instead of one diagram, four diagrams will be created. The four diagrams will have execution time as midnight 01:00 A.m., 02:00 A.m. And 03:00 A.m.. The start time for all four diagrams will be 410 because the execution time is the beginning of the scheduling interval and the start time is the time at which the diagram was created. We have already discussed about this in the previous slides.
Now let’s jump onto the web server and practice this concept. Here we are on our web server and we have two DAGs here. Let’s look at DAG one and we can see that the start date is 15 July. The catch up is set as false and the schedule interval is early. Let’s look at tag two here. The start date is again 15 July. The scheduling interval is early. The only difference here is the catchup is set as true. Now let’s turn on tag one. As per theory I’ve taught in the previous slide, only one diagram should be created. And here we have just one diagram. Now, if I turn on deck two, the start date was 15 July, the scheduling interval was early and the current time is 01:00 P.m. And 27 minutes. So there should be 13 diagrams created. We have two diagrams and more than two will be created.
So it’ll create around 13 diagrams. Let’s have a look at the tree view. Here we are on the tree view and you can see that the execution date or the run ID of these tasks, these diagrams is equal to the beginning of the scheduled interval. You can see here it is midnight, here it is 01:00 A.m., here it is 02:00 A.m.. If we see at the started at time, it is equal to the current time. So it is 01:27 p.m.. Here it is again 127, 128 and I hope that this example was able to make you understand this concept better. Do try playing around with catch up and add different start dates and schedule interval and see what happens. The next thing that I would like to talk about is good practices. So, so if you’re using catchup, please make sure that you have static dates in your DAG.
Free access to Qubole for 30 days to build data pipelines, bring machine learning to production, and analyze any data type from any data source.
See what our Open Data Lake Platform can do for you in 35 minutes.