Apache Airflow is a workflow management platform used to author workflows as Directed Acyclic Graphs (DAGs). This makes it easier to build data pipelines, monitor them, and perform ETL operations. A simple machine learning task may involve complex data pipelines. Triggering and monitoring these pipelines manually may cause unnecessary overhead and errors.
Qubole offers Airflow running on top of the Anaconda environment to make running machine learning pipelines and data science tasks seamless. Anaconda is an open source Python distribution for data science, machine learning, and large-scale data processing tasks with over 1,400 packages. This gives users the ease of running huge data pipelines along with better package support for their tasks. Qubole also offers Package Management, which allows users to install various Anaconda packages on their clusters directly from the UI without restarting the clusters.
Running Airflow on the Anaconda environment provides users with the simplicity of running machine learning and data science tasks by building complex data pipelines. It also gives them the flexibility to install various packages optimized for data science tasks available within the Anaconda environment on the go with the help of Qubole’s package management feature.
How to Run Airflow on Anaconda with Qubole
Step 1: Creating a cluster
- From the cluster page, select the Airflow cluster with the Python version set to 3.5. This will automatically attach this cluster to an Anaconda environment.
- A new Airflow cluster will be created and can then be used.
Step 2: Adding packages
- Various Python packages can be installed on the cluster from the Qubole Environments page without restarting the cluster. Just open the page and select your cluster.
- Add the package you require. The selected package will be installed in the Anaconda environment.
Step 3: Running shell commands on the cluster
- Qubole provides the flexibility of performing various shell commands directly from the Analyze page.
With the steps shown above, we have demonstrated how you can simplify the building of your data pipelines with the help of Qubole. Now you can build, train, and deploy various machine learning/ data science pipelines effortlessly right on top of the Anaconda environment with the support of package management.