In an earlier blog post, we discussed the availability of Jupyter-based Notebooks for machine learning (ML) and analytics with a host of features that make JupyterLab easy to use with optimized Apache Spark on Qubole.
Building on the same theme, we have implemented the next set of features for better integration of Notebooks with some of the core components of the Qubole platform like scheduler, access controls and data discovery. In this blog, we discuss how these enhancements further enrich the data science experience on Qubole.
1. Highly available Notebook interface
The new offering comes with an always available interface to interact with Notebooks. The Notebooks service can be used without starting clusters. Notebook contents can be viewed and edited when clusters are down; and cluster start is only required for Notebook execution.
In order to run a Notebook, users can select the desired cluster from the top right corner of the Notebook’s toolbar. Since we have detached the Notebook service from the cluster, users have the flexibility to change clusters for a Notebook without reloading the interface. The image below shows the cluster dropdown from which the desired cluster can be selected/changed.
To make the Notebooks service available without a running cluster, we serve the Notebook interface from a Kubernetes-based scalable and dedicated tier. With the goodness of Kubernetes (k8s) overloaded with various additional features, our new offering gives a highly enhanced user experience.
2. Scheduling Notebooks
Our customers extensively use Qubole’s Scheduler to create and manage schedules for their workloads; and with the new release they can now harness the power of this tool with Notebooks. With this new functionality, scheduling of Notebooks can be done using Scheduler via the user interface (UI) or application programming interface (API).
As shown in the screenshot below we have introduced the Jupyter Notebook Command under the command selection dropdown. Using this new command type users can schedule the Notebook to run periodically on any cluster.
2.1 Scheduler Widget
Scheduler support is well integrated with Notebooks using the scheduler widget, which provides the following functionality as shown in the screenshot:
- Create a new schedule for a Notebook.
- View existing schedules associated with a Notebook.
- View the run history of a scheduled Notebook.
- View the output of a run history by double-clicking on it.
- Sharable output link with other users conforming to role-based access controls (RBAC).
3. Parameterized Notebooks
Notebooks support parametrization that can be leveraged from the Scheduler interface and via API execution. Parametrization enables new ways to use Notebooks. For example, a Notebook with financial results can be run with different values of dates. We leverage the open-source tool Papermill for this purpose.
The general workflow for enabling parametrization is to designate a specific cell in the Notebook as “Parameters cell” from the Notebook UI. This cell can host the default values of the parameters. The cell can be marked with a “Parameters” tag as shown below.
Users can override these parameters during execution by passing them as arguments from the Scheduler interface or as parameters in the API payload.
4. Access control
We now support Access Control for Jupyter Notebooks. This lets users grant or restrict access to Notebooks on a per Notebook granularity.
A new Resource called Jupyter Notebook has been added in the Manage Roles section in the QDS Control Panel. Using this, administrators and product owners within a team can control which users in an account can have access to Notebooks, as shown in the screenshot below.
While resource-level permissions can be used to control complete access to the Notebooks, the owner (creator of the Notebook) can override these permissions at the individual Notebook level by setting an appropriate access policy.
Users having appropriate permissions can right-click on the Notebook and click on Manage Permissions as shown in the screenshot below.
Users can also set resource-level permissions (Read and Write) and access policies (Read, Write and Manage) on the folders using the File Browser. The policy can be set only on the first level folders under the user’s home and Common directory.
5. Data Discovery
We are introducing two new widgets in the left sidebar of the Notebook interface.
- Table Explorer —which allows the user to explore datasets by showing the schemas, tables and table columns. The datasets support Hive tables on all supported cloud platforms. It also supports Google BigQuery on Google Cloud.
- Object Storage Explorer —which provides users with a way to quickly view the buckets, folders and files from AWS S3, Blob Storage on Azure, and Google Cloud Storage files on Google Cloud.
The enhancements to Qubole’s Jupyter Notebooks provide an open, simple and integrated experience with the rest of the Qubole platform, enabling users with an improved data science experience. The new Notebook interface is currently under closed beta. Please contact our support team if you wish to enable it in your account.